没有合适的资源?快使用搜索试试~ 我知道了~
首页机器学习入门 kaggle房价预测 精讲(含代码)
资源详情
资源推荐
GettingtartedwithKaggle:
HouePriceCompetition
AdamMaachi 05MAY2017intutorial,pthon,andkaggle
Founded in 2010, Kaggle is a Data Science platform where users can
share, collaborate, and compete. One key feature of Kaggle is “Compe-
titions”, which oers users the ability to practice on real world data
and to test their skills with, and against, an international community.
This guide will teach you how to approach and enter a Kaggle competi-
tion, including exploring the data, creating and engineering features,
building models, and submitting predictions. We’ll use Python 3 and
Jupyter Notebook.
TheCompetition
We’ll work through the House Prices: Advanced Regression Techniques
competition.
We’ll follow these steps to a successful Kaggle Competition submission:
Acquire the data
Explore the data
Engineer and transform the features and the target variable
Build a model
Make and submit predictions
tep1:Acquirethedataandcreateour
environment
We need to acquire the data for the competition. The descriptions of the
features and some other helpful information are contained in a le with
an obvious name, data_description.txt .
Download the data and save it into a folder where you’ll keep every-
thing you need for the competition.
We will rst look at the train.csv data. After we’ve trained a model,
we’ll make predictions using the test.csv data.
First, import Pandas, a fantastic library for working with data in
Python. Next we’ll import Numpy.
We can use Pandas to read in csv les. The pd.read_csv() method cre-
ates a DataFrame from a csv le.
Let’s check out the size of the data.
We see that test has only 80 columns, while train has 81. This is due
to, of course, the fact that the test data do not include the nal sale
price information!
import pandas as pd
import numpy as np
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print ("Train data shape:", train.shape)
print ("Test data shape:", test.shape)
Train data shape: (1460, 81)
Test data shape: (1459, 80)
Next, we’ll look at a few rows using the DataFrame.head() method.
We should have the data dictionary available in our folder for the
competition. You can also nd it here.
Here’s a brief version of what you’ll nd in the data description le:
SalePrice - the property’s sale price in dollars. This is the target
variable that you’re trying to predict.
MSSubClass - The building class
MSZoning - The general zoning classication
LotFrontage - Linear feet of street connected to property
LotArea - Lot size in square feet
Street - Type of road access
Alley - Type of alley access
LotShape - General shape of property
LandContour - Flatness of the property
Utilities - Type of utilities available
LotConfig - Lot conguration
train.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley
0 1 60 RL 65.0 8450 Pave NaN
1 2 20 RL 80.0 9600 Pave NaN
2 3 60 RL 68.0 11250 Pave NaN
3 4 70 RL 60.0 9550 Pave NaN
4 5 60 RL 84.0 14260 Pave NaN
And so on.
The competition challenges you to predict the nal price of each home.
At this point, we should start to think about what we know about hous-
ing prices, Ames, Iowa, and what we might expect to see in this dataset.
Looking at the data, we see features we expected, like YrSold (the year
the home was last sold) and SalePrice . Others we might not have an-
ticipated, such as LandSlope (the slope of the land the home is built
upon) and RoofMatl (the materials used to construct the roof). Later,
we’ll have to make decisions about how we’ll approach these and other
features.
We want to do some plotting during the exploration stage of our
project, and we’ll need to import that functionality into our environ-
ment as well. Plotting allows us to visualize the distribution of the data,
check for outliers, and see other patterns that we might miss otherwise.
We’ll use Matplotlib, a popular visualization library.
tep2:xplorethedataandengineer
Feature
The challenge is to predict the nal sale price of the homes. This infor-
mation is stored in the SalePrice column. The value we are trying to
predict is often called the target variable.
We can use Series.describe() to get more information.
import matplotlib.pyplot as plt
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)
剩余32页未读,继续阅读
luguanyou
- 粉丝: 68
- 资源: 5
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- 中文翻译Introduction to Linear Algebra, 5th Edition 2.1节
- zigbee-cluster-library-specification
- JSBSim Reference Manual
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功