*****************
*** LECTURE 1 ***
*****************

*****************
* 1.	PRELIMINARY OPERATIONS
* lines preceded by * (colored green) are not commands, just comments
*(the path "C:\Desktop\lezione_1" depends on the folder where you save the lesson materials on your computer)
cd "C:\Desktop\lezione_1"

cap log close
log using "lecture1.log", replace

*****************
* 2.	EXPLORATORY ANALYSIS OF THE DATABASE RPER.DTA
* open the rper.dta database in the Stata memory
use rper.dta, clear

* basic commands to describe and look at the data
des
sum
browse
edit

*****************
* 3.	SELECTION OF VARIABLES AND OBSERVATIONS OF INTEREST, RENAMING AND LABELLING OF VARIABLES

* this is a large dataset, but we are interested only on some variables and some observations

* I want to keep only a subset of variables in my data, since I do not need the others
keep nquest nord anno yl yt ytp

* I can also drop a variable (or list of variables)
drop yt

* now I can check again how the dataset has been changed
des

* we can make the data more understandable by using labels or changing variable names
rename nquest id_household
rename nord member_household

label var id_household "household identifier"
label var member_household "identifier of houshold member"
label var yl "income from employment"
label var ytp "income from pension"

* now I can check again how the dataset has been changed
des

*there are many years in the database
ta anno

* if I want to keep only observations belonging to the years 2014, 2016 and 2020 I can run the following
keep if inrange(anno, 2014,2020)

* I can also drop some observations. For example, to drop the year 2020
drop if anno==2020

* check how many years we have now
ta anno


*****************
* 4.	HYPOTHESIS TESTING

* is labor income higher in 2014 or in 2016?
tab anno, sum(yl)

* do we miss something? these are nominal wages, to compare real wages I need to adjust for inflation
* on "ISTAT rivaluta" website I can check that 1 eur. in 2014 is worth 0.998 eur. of 2016
* let's adjust wages of 2014 to their 2016 value

* I can generate new variables in stata, for example the consumer price index
gen cpi=1 if anno==2016
replace cpi=0.998 if anno==2014

gen yl_real=yl/cpi
gen ytp_real=ytp/cpi

* now let's check whether real labor income was higher in 2016 than 2014
tabstat yl_real, by(anno) s(mean sd min max n)

* notice that we can do similar things using tab or using tabstat. Often the same analysis can be performed in many ways on STATA. What matters is that the final result is correct...

* notice that there are many workers with 0 labor earnings. We can assume that they are not working
* I want to compute earnings for indivduals that work
tabstat yl_real if yl_real>0, by(anno) s(mean sd min max n)

* TO FORMALLY TEST THE SIGNIFICANCE OF THE DIFFERENCE IN AVERAGE EARNINGS ACROSS YEARS
ttest yl_real if yl_real>0, by(anno)

* now let's run another ttest: is average pension income higher than labor income?
ttest yl_real==ytp_real

*****************
* 5.	HYPOTHESIS TESTING FOR CATEGORICAL VARIABLES

* is the employment rate in 2016 higher than in 2014?
gen employed=yl_real>0
ta employed
* employed is a binary variable (dummy). We should perform a different type of test than the ttest
* Pearson chi2 test (under H0 anno and employed are independent)
ta anno employed, row  chi2 
* prtest is designed for binary variables, and has a similar interpretation than ttest
prtest employed, by(anno)

* is the proportion of retirees higher in 2014 or 2016?
gen retired=ytp_real>0
 prtest retired, by(anno)

*****************
* 6.	LONGITUDINAL ANALYSIS: INDIVIDUAL INCOME GROWTH

* in this dataset we can potentially observe the same individual both in 2014 and 2016

* first let's create a unique individual identifier
* an individual is given by the combination of the houshold id and the position in the of houshold id
* this command assigns a unique value to the variable id_individual for each combination of the variables id_household and member_household
egen id_indivdiual=group(id_household member_household)

* how many individuals are observed more than once (in 2014 and 2016)
duplicates report id_indivdiual
* another way to check this
* _N is the total number of observations. If I compute this by individual, I have the number of times the same individual is observed in the data
bys id_indivdiual: ge N=_N
ta N
* are there 11,742/2=5871 individuals observed two times

* compute the growth in labor income for the same individual across time
* only for individuals that work both in 2016 and 2014
* since the command is long, I split it in two rows using the "///" symbol (to execute the command, I have to select both rows)

bys id_indivdiual (anno): gen yl_growth=yl_real[_N]-yl_real[1] ///
if _n==2&employed[2]==1&employed[1]==1

* NOTICE
* bys id_indivdiual (anno) means that the command is executed by individuals, but before executing the command the data is also sorted by year
/*
in practice, the data will be sorted as follows
	id_ind	anno
	1		2014
	1		2016
	2		2014
	2		2016
	3		2014
	3		2016
	...
and the command "gen" + "if" will be executed separately for each individual

the result doesn't change if I did write this command instead (why?)

bys id_indivdiual (anno): gen yl_growth=yl_real[_N]-yl_real[1] ///
if _n==_N&employed[_N]==1&employed[1]==1

*/


su yl_growth
* was the individual growth statistically different from zero on average?
ttest yl_growth==0

* we can perform the same computations using the xtset stata command
xtset id_indivdiual anno

* "L2." means the two period lag (stata understands that the time variable is year, and that there are two years of distance between 2016 and 2014)
gen yl_growth_bis=yl_real-L2.yl_real if employed==1&L2.employed==1
* check that the two variables are identical!
su yl_growth yl_growth_bis

*****************
* 7.	LONGITUDINAL ANALYSIS: THE REPLACEMENT RATE

* some workers in 2014 become retirees in 2016
* what is the replacement rate?
count if L2.employed==1&retired==1

gen replacement_rate=ytp_real/L2.yl_real if L2.employed==1&retired==1
sum replacement_rate
* if I run summarize with the option ",de" I get more info about the variable
sum replacement_rate, de

* notice that quite a lot of individuals have a replacement rate above 1
* why?
* - pensions usually depend on the entire career of workers, not just on their very last job spell
* - measurement error/low representativeness of the sample...

* let's check if labor income in the last job is positively correlated with pension income...
gen last_job=L2.yl_real if L2.employed==1&retired==1

corr ytp_real last_job

* we can check this with a regression analysis as well
reg ytp_real last_job

* and also with a scatterplot
twoway (scatter ytp_real last_job) (lfit ytp_real last_job)


*****************
* 8. FINALLY, LET'S CLOSE THE LOG FILE
log close