Banner Image

We planned to show the correlation between price and zip codes, ratings and price,median income and zip codes,  

  1. Checking the data for inaccuracies.  

The data contained two files, the first file containing 10710 entries with categories that included Restaurant id, name, is_claimed, is_closed, phone number, review_count, categories01, categories02, categories03, rating, price, transactions, zip_code, city, address, restaurant_url, image_url, latitude, longitude, photos, cross_streets. The second sheet contained Restaurant ID and review text. After importing this data we used the function dim(MyData) to count the number of observations and variables. We discovered 1070 observations and 21 variables. We cleaned and repaired the data that was only going to be necessary for us to use.  

Zip Codes 

We used which($zip_code))  to identify the missing observations for the zip codes. We found missing entries for rows 903, 2214, 5292, 5293, 5294, 5295, 5412, 5991  from there we went on to excel to look at the resturants yelp pages and determine their zip codes from there. We found other entries that were clearly errors for example entry: 90oV7qCGRbyN0mxaySBoUA was actually in Oklahoma and not California. There were other instances of this or similar scenarios and in order to fix this we filtered the zip codes in excel that were not in the range of the Californian zip codes. 



We had difficulty finding the mean of the price row as there were many empty rows and as the price row was represented by a $ symbol (ranging from $ to $$$$) it was difficult to make the conversion however, we used 

 > mydata.recorded1 <-mydata%>%mutate(pricerecorded=recode(price,"$"=1,"ss"=2,"$$$"=3))                          > View(mydata.recoded1)     

> priceMean<-(pricerecoded)                             

> priceMean<mean(pricerecoded,na.rm=TRUE) 

> priceMean<-mean(mydata.recoded$pricerecoded) 

> priceMean = 1.578526 

> mydata.recoded1[]<-priceMean 

> mydata.recoded$pricerecoded[$pricerecoded)]priceMean 

> dfl$round(dfl$pricerecoded) 

> round(mydata.recoded1$pricerecoded,0) 

[1]2 2 2 1 1 1 2 2 2 2 1 1 1 2 1 2 1 2 2 1 2 1 1 1 1 1 2 2 2 1 2 1 1 2 1 2 2 2 2 1 2 

to fix this problem. We converted each symbol to a number and then found the mean. 

Language: R
Are you a contestant for RMDS 2021 Data Science Competition?
Type: Other
Release Date: Oct 10, 2021
Last Updated: Oct 10, 2021

Average rating is 4.0 with 1 vote(s)

Please sign in or create an account to give a rating or comment.

Please sign in or create an account to view the download file