Gabor数据分析:第1章陷入代码:如何设置目录

6jygbczu  于 11个月前  发布在  其他
关注(0)|答案(1)|浏览(98)

亲爱的Stackoverflow社区,
我正在尝试自学R语言和数据分析,使用的是教科书“Békés,Gábor. Data Analysis for Business,Economics,and Policy”(https://gabors-data-analysis.com/),而我使用的是Hotel维也纳Dataset(https://osf.io/y6jvb),被困在下面的代码中。
我确实有一点使用R的经验,但由于我的基础非常薄弱,我从一开始就重新自学,真的需要你一步一步的指导如何找出下面的代码。

**教科书的实用问题:**以本章中使用的hotels-viennadataset为例,使用计算机选取25、50和200个样本,计算每个样本中酒店价格的简单平均值,并将其与整个数据集中的酒店价格进行比较,重复此练习三次并记录结果,评论不同样本的平均值如何变化。
数据集:https://osf.io/y6jvb
代码

library(tidyverse)

# set working directory
# option A: open material as project
# option B: set working directory for da_case_studies

#example: setwd("C:/Users/bekes.gabor/Documents/github/da_case_studies/")

#set data dir, load theme and functions

setwd("C:/Users/sha/Desktop/R/intro/data/da_case_studies/")

source("ch00-tech-prep/theme_bg.R")
source("ch00-tech-prep/da_helper_functions.R")`

字符串
我不知道如何做data_dir和获得设置数据目录。R(链接有关如何设置计算机https://gabors-data-analysis.com/howto-r/
使用的数据:

source("set-data-directory.R") #data_dir must be first defined #data_in <- paste(data_dir,"hotels-vienna","clean/", sep = "/")

use_case_dir <- "ch01-hotels-data-collect/"
data_out <- use_case_diroutput <- paste0(use_case_dir,"output/")create_output_if_doesnt_exist(output)

# load in clean and tidy data and create workfile

df <-  read.csv(paste0(data_in,"hotels-vienna.csv"))

# or from the website

df <- read_csv("https://osf.io/y6jvb/download")

# First look

df <- df%>%
  select(hotel_id, accommodation_type, country, city, city_actual, neighbourhood, center1label, distance,center2label, distance_alter, stars, rating, rating_count, ratingta, ratingta_count, year, month,weekend, holiday, nnights, price, scarce_room, offer, offer_cat)

summary(df)glimpse(df)

# export list

df <- subset(df, select = c(hotel_id, accommodation_type, country, city, city_actual, center1label, distance, stars, rating, price)) 
write.csv(df[1:5,], paste0(output, "hotel_listobs.csv"), row.names = F)

使用Dput()的数据集

dput(head(df[, c(1:10)]))

structure(list(hotel_id = c(21894L, 21897L, 21901L, 21902L, 21903L, 
21904L), accommodation_type = c("Apartment", "Hotel", "Hotel", 
"Hotel", "Hotel", "Apartment"), country = c("Austria", "Austria", 
"Austria", "Austria", "Austria", "Austria"), city = c("Vienna", 
"Vienna", "Vienna", "Vienna", "Vienna", "Vienna"), city_actual = c("Vienna", 
"Vienna", "Vienna", "Vienna", "Vienna", "Vienna"), center1label = c("City centre", 
"City centre", "City centre", "City centre", "City centre", "City centre"
), distance = c(2.7, 1.7, 1.4, 1.7, 1.2, 0.9), stars = c(4, 4, 
4, 3, 4, 5), rating = c(4.4, 3.9, 3.7, 4, 3.9, 4.8), price = c(81L, 
81L, 85L, 83L, 82L, 229L)), row.names = c(NA, 6L), class = "data.frame")

我尝试过的:

setwd("C:/Users/sha03/Desktop/R/intro/data/da_case_studies/")

source("theme_bg.R")
source("da_helper_functions.R")

read.csv('C:/Users/sha03/Desktop/R/intro/data/da_case_studies/hotels-vienna.csv')

summary(df)
glimpse(df)

我似乎不能得到我应该得到的答案,这是(https://github.com/gabors-data-analysis/da_case_studies/blob/master/ch01-hotels-data-collect/ch01-hotels-data-collect.ipynb

46scxncf

46scxncf1#

要计算大小为25、50、200的样本和整个数据集的平均值,可以使用sample来索引行和price列。记住在处理随机样本时设置种子,以便结果始终是可重现的。
你可以单独做:

set.seed(123)
n25 <- mean(df[sample(seq_len(nrow(df)), 25), "price"])
n50 <- mean(df[sample(seq_len(nrow(df)), 50), "price"])
n200 <- mean(df[sample(seq_len(nrow(df)), 200), "price"])
nall <- mean(df[, "price"])

#> n25
# [1] 139.88
# > n50
# [1] 145.86
# > n200
# [1] 119.655
# > nall
# [1] 131.3668

字符串
或者一次完成:

set.seed(123)
n <- c(25, 50, 200, nrow(df))

setNames(
  vapply(n, \(x) mean(df[sample(seq_len(nrow(df)), x), "price"]), 1),
         paste0("n = ", n))

#   n = 25   n = 50  n = 200  n = 428 
# 132.6800 137.9800 121.4600 131.3668


如果你需要重复三次,你可以使用lapply,它输出一个列表:

set.seed(123)

lapply(1:3, \(y) setNames(
  vapply(n, \(x) mean(df[sample(seq_len(nrow(df)), x), "price"]), 1),
         paste0("n = ", n)))

# [[1]]
# n = 25   n = 50  n = 200  n = 428 
# 139.8800 145.8600 119.6550 131.3668 
# 
# [[2]]
# n = 25   n = 50  n = 200  n = 428 
# 140.1200 113.6000 132.2300 131.3668 
# 
# [[3]]
# n = 25   n = 50  n = 200  n = 428 
# 132.6800 137.9800 121.4600 131.3668

相关问题