亲爱的Stackoverflow社区,
我正在尝试自学R语言和数据分析,使用的是教科书“Békés,Gábor. Data Analysis for Business,Economics,and Policy”(https://gabors-data-analysis.com/),而我使用的是Hotel维也纳Dataset(https://osf.io/y6jvb),被困在下面的代码中。
我确实有一点使用R的经验,但由于我的基础非常薄弱,我从一开始就重新自学,真的需要你一步一步的指导如何找出下面的代码。
**教科书的实用问题:**以本章中使用的hotels-viennadataset为例,使用计算机选取25、50和200个样本,计算每个样本中酒店价格的简单平均值,并将其与整个数据集中的酒店价格进行比较,重复此练习三次并记录结果,评论不同样本的平均值如何变化。
数据集:https://osf.io/y6jvb
代码
library(tidyverse)
# set working directory
# option A: open material as project
# option B: set working directory for da_case_studies
#example: setwd("C:/Users/bekes.gabor/Documents/github/da_case_studies/")
#set data dir, load theme and functions
setwd("C:/Users/sha/Desktop/R/intro/data/da_case_studies/")
source("ch00-tech-prep/theme_bg.R")
source("ch00-tech-prep/da_helper_functions.R")`
字符串
我不知道如何做data_dir
和获得设置数据目录。R(链接有关如何设置计算机https://gabors-data-analysis.com/howto-r/)
使用的数据:
source("set-data-directory.R") #data_dir must be first defined #data_in <- paste(data_dir,"hotels-vienna","clean/", sep = "/")
use_case_dir <- "ch01-hotels-data-collect/"
data_out <- use_case_diroutput <- paste0(use_case_dir,"output/")create_output_if_doesnt_exist(output)
# load in clean and tidy data and create workfile
df <- read.csv(paste0(data_in,"hotels-vienna.csv"))
# or from the website
df <- read_csv("https://osf.io/y6jvb/download")
# First look
df <- df%>%
select(hotel_id, accommodation_type, country, city, city_actual, neighbourhood, center1label, distance,center2label, distance_alter, stars, rating, rating_count, ratingta, ratingta_count, year, month,weekend, holiday, nnights, price, scarce_room, offer, offer_cat)
summary(df)glimpse(df)
# export list
df <- subset(df, select = c(hotel_id, accommodation_type, country, city, city_actual, center1label, distance, stars, rating, price))
write.csv(df[1:5,], paste0(output, "hotel_listobs.csv"), row.names = F)
型
使用Dput()的数据集
dput(head(df[, c(1:10)]))
structure(list(hotel_id = c(21894L, 21897L, 21901L, 21902L, 21903L,
21904L), accommodation_type = c("Apartment", "Hotel", "Hotel",
"Hotel", "Hotel", "Apartment"), country = c("Austria", "Austria",
"Austria", "Austria", "Austria", "Austria"), city = c("Vienna",
"Vienna", "Vienna", "Vienna", "Vienna", "Vienna"), city_actual = c("Vienna",
"Vienna", "Vienna", "Vienna", "Vienna", "Vienna"), center1label = c("City centre",
"City centre", "City centre", "City centre", "City centre", "City centre"
), distance = c(2.7, 1.7, 1.4, 1.7, 1.2, 0.9), stars = c(4, 4,
4, 3, 4, 5), rating = c(4.4, 3.9, 3.7, 4, 3.9, 4.8), price = c(81L,
81L, 85L, 83L, 82L, 229L)), row.names = c(NA, 6L), class = "data.frame")
型
我尝试过的:
setwd("C:/Users/sha03/Desktop/R/intro/data/da_case_studies/")
source("theme_bg.R")
source("da_helper_functions.R")
read.csv('C:/Users/sha03/Desktop/R/intro/data/da_case_studies/hotels-vienna.csv')
summary(df)
glimpse(df)
型
我似乎不能得到我应该得到的答案,这是(https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch01-hotels-data-collect/ch01-hotels-data-collect.ipynb)
1条答案
按热度按时间46scxncf1#
要计算大小为25、50、200的样本和整个数据集的平均值,可以使用
sample
来索引行和price
列。记住在处理随机样本时设置种子,以便结果始终是可重现的。你可以单独做:
字符串
或者一次完成:
型
如果你需要重复三次,你可以使用
lapply
,它输出一个列表:型