Please submit your work as either a pdf or a word file.

**Data Assignment Guidelines**

**Overview**

In each data assignment, you will be asked to solve a series of questions raised in a business context using the statistical techniques you have learned in class. Each data assignment will be accompanied by a data set extracted from different sources. Below is a list of requirements you need to follow for each data assignment:

- Each data assignment can be finished by up to three people. You may also work individually if you want.

- Everyone in the group should contribute. All team members are equally responsible for the materials presented in each data assignment. Team members in the same group will receive the same grade for the work submitted. One submission for each group.

- Please use RStudio to solve problems in each data assignment. Please do not use other statistical analysis software packages such as STATA, JMP, SPSS, Excel, etc..

- For each question, please show the following to earn full credit unless specified otherwise:

1) numerical answers;

2) the RStudio codes used to derive each numerical answer;

3) interpretation of results.

- When needed, use a 5% significance level or 95% confidence interval for analyses.

**Sample Data Assignment Question**

- Estimating the mean ROA ratio of open-traded firms in the United States in the financial year 2020 and interpret your results (include both statistical and economic interpretation). What are the limitations and/or assumptions in your analysis?

1) Numerical answers and RStudio codes (10 points)

Point Estimate | 0.057 | PE<-mean(Exp$ROA) |

Margin of Error | 0.023 | ME<-qt(0.975,64)*sd(Exp$ROA)/sqrt(65) |

Lower Bound | 0.034 | PE – ME |

Upper Bound | 0.080 | PE + ME |

2) Interpretation (10 points)

We are 95% sure that the mean ROA ratio of the open-traded firms in the United States in the financial year 2020 is between 0.03 and 0.08.

Based on the Forbes Advisors: “A ROA of 5% or better is typically considered a good ratio while 20% or better is considered great.” (Birken & Curry, 2020). Therefore, our results suggest that the open-traded firms only show acceptable profitability in the financial year 2020. (You may add more explanations here: e.g., pandemic.)

3) Limitation (5 points)

Due to the availability of data, there only are 65 observations in the sample data set. The sample size is sufficient to justify the normality assumption, but a bigger sample could yield a more accurate estimation.

Assuming you are a statistician working in the U.S. Bureau of Labor Statistics. Due to the COVID-19 pandemic, you were asked to provide a report studying the influence of pandemic on a student’s academic performance. Additionally, you were also asked to analyze several potential factors (e.g. education level, marital status, etc.) that may affect the status of marijuana use among undergraduate and graduate students.

To prepare the report asked by the U.S. Bureau of Labor Statistics, you were provided with a data set extracted from the National Longitudinal Survey of Youth 97 (NLSY97). Survey data were collected from 2019 to 2020. The data set contains the following variables reflecting different demographic information and academic performance of respondents.

1) PUBID: assigned id of each respondents

2) HIGHEST_DEGREE: 1=bachelor’s or above; 0=below bachelor’s

3) MARITAL_STATUS: 1=married; 0=unmarried

4) MARIJUANA_USE: 1=had used marijuana; 0=had not used marijuana

5) GPA_2019: the gpa score based on 2019 semesters.

6) GPA_2020: the gpa score based on 2020 semesters.

7) WEEKLY_D: the average number of drinks per week.

- To analyze the potential influence of the pandemic on students’ academic performance, you decide to compare the consistency of students’ GPA in 2019 (before the pandemic) and that in 2020 (during pandemic). Construct a 95% confidence interval for GPA in those two years respectively and interpret your results (only statistical interpretation is needed). What conclusion you can draw based on the two confidence intervals you constructed? What assumption did you make when constructing those confidence intervals, and how do you justify your assumption? [Chapter 11]
- A study has shown that there is more inconsistency regarding the drinking pattern of students during 2020. The study made the claim that the variance of the number of drinks a student had per week has been increased to more than 5. Set up the competing hypotheses and test your claim. Interpret your results (only statistical interpretation is needed). [Chapter 11]
- Substance use among college-aged youths is a main concern in health discussion. Studies have shown that married individuals are less likely to use marijuana than unmarried individuals. You would like to test if marijuana use varies based on marital status (i.e., the proportion of married individuals who used marijuana = the proportion of unmarried individuals who used marijuana=0.50). Set up the competing hypotheses, construct a table including the hypothesized proportions and observed frequencies, and conduct corresponding tests. Interpret your results (include both statistical and economic interpretation.) What assumptions did you make when conducting the test? [Chapter 12]

[hint: For economic interpretation, you may also use any resources you want to help explain your results and make recommendations based on your results]

- To study the marijuana use status among college-aged youths, you would like to test if marijuana use is independent from education level since education is often characterized as a preventive factor for substance use. Setting up the competing hypotheses, building the contingency table, and conducting corresponding tests. Interpret your results (include both statistical and economic interpretation.). Does the sample meet the minimum requirement for testing for independence [Chapter 12]

[hint: For economic interpretation, you may also use any resources you want to help explain your results and make recommendations based on your results]