Demographic Impact on Investment

Author

Collin Real

Published

July 26, 2024

SRI LAKSHMIR CHUNDRU contributed to this exercise.

1 Summary/Abstract

The analysis’s objective is to understand the insights and patterns recognized between customer demographics and their financial situation.

2 Introduction

2.1 General Background Information

Customer demographics, such as religion, gender, height, and weight were analyzed to identify which of these demographics had more significant influences on an individual’s investments.

2.2 Description of data and data source

The two variables I added to the data sheet are Religion and Investments. The Religion variable seeks to identify the spiritual beliefs, if any, of each individual. The Investments variable represents the total fair market value of an individual’s investment accounts.

2.3 Questions/Hypotheses to be addressed

Can individual’s be clustered into groups based on their demographics and the fair market value of their investment accounts? If so, how do we use these clusters to gain a competitive advantage?

3 Methods

We used R to import, clean, and run analysis on the data.

3.1 Data aquisition

Dummy data was used for this report.

3.2 Data import and cleaning

3.2.0.1 Import libraries.

library(readxl) #for loading Excel files
library(dplyr) #for data processing/cleaning


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr) #for data processing/cleaning
library(skimr) #for nice visualization of data 
library(here) #to set paths

3.2.0.2 Import Data

data_location <- here::here("starter-analysis-exercise","data","raw-data","exampledata2.xlsx")
rawdata <- readxl::read_excel(data_location)
print(rawdata)

# A tibble: 14 × 5
   Height Weight Gender Religion     Investments
   <chr>   <dbl> <chr>  <chr>              <dbl>
 1 180        80 M      Christianity       11000
 2 175        70 O      Agnostic            5000
 3 sixty      60 F      Islam               6500
 4 178        76 F      Judaism             9000
 5 192        90 NA     Atheist            12000
 6 6          55 F      Christianity        3000
 7 156        90 O      Buddhism            4200
 8 166       110 M      Hinduism            6600
 9 155        54 N      Atheist             1200
10 145      7000 M      Agnostic            8700
11 165        NA F      Islam              12400
12 133        45 F      Mormonism           2600
13 166        55 M      Mormonism           9900
14 154        50 M      Buddhism            7300

3.2.0.3 Check Data

codebook <- readxl::read_excel(data_location, sheet ="Codebook")
print(codebook)

# A tibble: 5 × 3
  `Variable Name` `Variable Definition`                 `Allowed Values`        
  <chr>           <chr>                                 <chr>                   
1 Height          height in centimeters                 numeric value >0 or NA  
2 Weight          weight in kilograms                   numeric value >0 or NA  
3 Gender          identified gender (male/female/other) M/F/O/NA                
4 Religion        spiritual beliefs                     Christianity, Islam, Ju…
5 Investments     fmv of investments in dollars         numeric value, or NA

3.2.0.4 Descriptive Statistics

dplyr::glimpse(rawdata)

Rows: 14
Columns: 5
$ Height      <chr> "180", "175", "sixty", "178", "192", "6", "156", "166", "1…
$ Weight      <dbl> 80, 70, 60, 76, 90, 55, 90, 110, 54, 7000, NA, 45, 55, 50
$ Gender      <chr> "M", "O", "F", "F", "NA", "F", "O", "M", "N", "M", "F", "F…
$ Religion    <chr> "Christianity", "Agnostic", "Islam", "Judaism", "Atheist",…
$ Investments <dbl> 11000, 5000, 6500, 9000, 12000, 3000, 4200, 6600, 1200, 87…

summary(rawdata)

    Height              Weight          Gender            Religion        
 Length:14          Min.   :  45.0   Length:14          Length:14         
 Class :character   1st Qu.:  55.0   Class :character   Class :character  
 Mode  :character   Median :  70.0   Mode  :character   Mode  :character  
                    Mean   : 602.7                                        
                    3rd Qu.:  90.0                                        
                    Max.   :7000.0                                        
                    NA's   :1                                             
  Investments   
 Min.   : 1200  
 1st Qu.: 4400  
 Median : 6950  
 Mean   : 7100  
 3rd Qu.: 9675  
 Max.   :12400

head(rawdata)

# A tibble: 6 × 5
  Height Weight Gender Religion     Investments
  <chr>   <dbl> <chr>  <chr>              <dbl>
1 180        80 M      Christianity       11000
2 175        70 O      Agnostic            5000
3 sixty      60 F      Islam               6500
4 178        76 F      Judaism             9000
5 192        90 NA     Atheist            12000
6 6          55 F      Christianity        3000

skimr::skim(rawdata)

Data summary
Name	rawdata
Number of rows	14
Number of columns	5
_______________________
Column type frequency:
character	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Height	1	1	5	13
Gender	1	1	2	5
Religion	1	5	12	8

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Weight	1	0.93	602.69	1922.25	45	55	70	90	7000	▇▁▁▁▁
Investments	0	1.00	7100.00	3580.50	1200	4400	6950	9675	12400	▇▅▇▇▇

3.2.0.5 Data Cleaning

d1 <- rawdata %>% dplyr::filter( Height != "sixty" ) %>% 
                  dplyr::mutate(Height = as.numeric(Height))

d2 <- d1 %>% dplyr::mutate( Height = replace(Height, Height=="6",round(6*30.48,0)))

d3 <- d2 %>%  dplyr::filter(Weight != 7000) %>% tidyr::drop_na()
d3$Gender <- as.factor(d3$Gender)
d4 <- d3 %>% dplyr::filter( !(Gender %in% c("NA","N")) ) %>% droplevels()  
processeddata <- d4
processeddata

# A tibble: 9 × 5
  Height Weight Gender Religion     Investments
   <dbl>  <dbl> <fct>  <chr>              <dbl>
1    180     80 M      Christianity       11000
2    175     70 O      Agnostic            5000
3    178     76 F      Judaism             9000
4    183     55 F      Christianity        3000
5    156     90 O      Buddhism            4200
6    166    110 M      Hinduism            6600
7    133     45 F      Mormonism           2600
8    166     55 M      Mormonism           9900
9    154     50 M      Buddhism            7300

3.3 Statistical analysis

One linear model was created with height as our dependent variable and weight/religion as our predictors.

4 Results

4.1 Exploratory/Descriptive analysis

?@tbl-summarytable shows a summary of the data.

Table 1: Data summary table.
skim_type	skim_variable	complete_rate	factor.ordered	factor.n_unique	factor.top_counts	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
factor	Gender	1	FALSE	3	M: 4, F: 3, O: 2	NA	NA	NA	NA	NA	NA	NA	NA
numeric	Height	1	NA	NA	NA	165.66667	15.97655	133	156	166	178	183	▂▁▃▃▇
numeric	Weight	1	NA	NA	NA	70.11111	21.24526	45	55	70	80	110	▇▂▃▂▂

4.2 Basic statistical analysis

Figure 1 shows a scatterplot figure produced by one of the R scripts.

Figure 1: Height and weight stratified by gender.

4.3 Full analysis

Example Table 2 shows a summary of a linear model fit.

Table 2: Linear model fit table.
term	estimate	std.error	statistic	p.value
(Intercept)	149.2726967	23.3823360	6.3839942	0.0013962
Weight	0.2623972	0.3512436	0.7470519	0.4886517
GenderM	-2.1244913	15.5488953	-0.1366329	0.8966520
GenderO	-4.7644739	19.0114155	-0.2506112	0.8120871

4.3.1 Distribution of Height by Religion

Figure 2: Distribution of height by religion.

4.3.2 Correlation Between Investments and Weight

Figure 3: The relationship between investments and weight.

4.3.3 Investments/Religious Influence on Height

Table 3: Linear model fit table.
term	estimate	std.error	statistic	p.value
(Intercept)	164.913979	37.0650265	4.4493150	0.0469828
Weight	0.144086	0.4760345	0.3026798	0.7907129
ReligionBuddhism	-20.000000	19.8783599	-1.0061192	0.4203016
ReligionChristianity	6.860215	19.9139524	0.3444929	0.7633274
ReligionHinduism	-14.763441	29.8234767	-0.4950275	0.6696182
ReligionJudaism	2.135484	23.1305752	0.0923230	0.9348565
ReligionMormonism	-22.618280	22.0407063	-1.0262048	0.4126949

5 Discussion

5.1 Summary and Interpretation

Two diagnostic plots were created to visualize the data. The box plot identifies the distribution of data among the categorical variable Religion and helps identify potential outliers. Our box plots demonstrate few variation in height for all religions except Mormonism. The scatter plot illustrates the relationship between the Weight and Investments variables (i.e. their degree of correlation). The scatter plot does not show much correlation between the two variables.

5.2 Strengths and Limitations

One major limitation is the small sample size. It’s difficult to identify patterns when a lack of data is present. Another limitation is not creating additional plots for other combinations of variables. One strength is the level of demographic data which is essential for market segmentation.

5.3 Conclusions

One of the key takeaways from this project is the importance of data acquisition and validating it’s integrity. Small samples sizes and junk data limit the opportunity of producing valuable insights. Data is the most critical asset for predictive modeling. If the integrity of a dataset is in question, so are the results of every model built on top of it.

This paper (Leek & Peng, 2015) discusses types of analyses.

These papers (McKay, Ebell, Billings, et al., 2020; McKay, Ebell, Dale, Shen, & Handel, 2020) are good examples of papers published using a fully reproducible setup similar to the one shown in this template.

Note that this cited reference will show up at the end of the document, the reference formatting is determined by the CSL file specified in the YAML header. Many more style files for almost any journal are available. You also specify the location of your bibtex reference file in the YAML. You can call your reference file anything you like, I just used the generic word references.bib but giving it a more descriptive name is probably better.

6 References

Leek, J. T., & Peng, R. D. (2015). Statistics. What is the question? Science (New York, N.Y.), 347(6228), 1314–1315. https://doi.org/10.1126/science.aaa6146

McKay, B., Ebell, M., Billings, W. Z., Dale, A. P., Shen, Y., & Handel, A. (2020). Associations Between Relative Viral Load at Diagnosis and Influenza A Symptoms and Recovery. Open Forum Infectious Diseases, 7(11), ofaa494. https://doi.org/10.1093/ofid/ofaa494

McKay, B., Ebell, M., Dale, A. P., Shen, Y., & Handel, A. (2020). Virulence-mediated infectiousness and activity trade-offs and their impact on transmission potential of influenza patients. Proceedings. Biological Sciences, 287(1927), 20200496. https://doi.org/10.1098/rspb.2020.0496