*****************
*** LESSON 3 ***
*****************

* 1.	Preliminary Operations
cd "C:\Desktop\lezione_3"
log using "lecture3.log", replace

* 2.	ESPOLORATIVE ANALYSIS OF THE panel_rl.dta DATABASE AND POLYNOMIAL SPECIFICATION
use panel_rl.dta, clear

describe
summarize

* I generate the variable age and its powers
ge eta=anno-anno_nascita

ge eta2=eta^2

ge eta3=eta^3


* multivariate regression:
* I can insert "fixed effects" with the prefix i.
* the prefix i.variable inserts a dummy variable into the regression for each value of "variable"

reg retrib03 eta n_dipendenti tempo_d occ_manuale uomo i.settore i.anno

* specify income as a cubic function of age

reg retrib03 eta eta2 eta3 n_dipendenti tempo_d occ_manuale uomo i.settore i.anno

*check the null hypothesis of linearity against the alternative hypothesis that the population regression is quadratic or cubic

test eta2 eta3

* Effect of going from 39 to 40: first we predict fitted values

predict y_hat

* the marginal effect is given by the variable "dif" in the output below.
* warning: this procedure is not correct for making statistical inference!
* the standard error of y_hat is calculated without taking into account that y_hat is a parameter derived from a regression model!

ttest y_hat if inrange(eta, 39,40), by(eta)


* to make statistical inference on marginal effects we replicate the regression from before.
* we use c.eta##c.eta##c.eta to specify a triple interaction of age (equivalent to entering eta eta2 and eta3, but this time we do not have to create the variables eta2 and eta3)

reg retrib03 c.eta##c.eta##c.eta n_dipendenti tempo_d occ_manuale uomo i.settore i.anno

* the command below calculates the predicted values for each age level with the corrected error standards

margins, over(eta) post

* now we calculate the difference between the predicted values at 40 and 39 years old

di _b[40.eta]-_b[39.eta]

* we test whether this difference is equal to 0

test _b[40.eta]==_b[39.eta]

* the conclusions of the test are identical to those obtained with the wrong procedure ("ttest y_hat").
* however, if we tested a more uncertain hypothesis (e.g., that the predicted value of y is 41400 at 40 years) we might get different conclusions!

test _b[40.eta]==41400
ttest y_hat==41400 if eta==40


* 2.	LOGARITHMIC SPECIFICATIONS

*******************
*******************
* model with firm size as a linear function: 
* _b gives us the increase of Y for a unit increase of X (increase of one employee)

reg retrib03 n_dipendenti c.eta##c.eta##c.eta tempo_d occ_manuale uomo i.settore i.anno

* we draw the relationship beteween income and firm size (we ignore the constant and other variables)
ge lineare=_b[n_dipendenti]*n_dipendenti
sort n_dipendenti
twoway (line lineare n_dipendenti), ytitle("Predicted income")


*******************
*******************
* Lin-log model: _b/100 = increase in Y for a 1% increase in X

* we try to transform the variable number of employees into log

gen ln_dipendenti=ln(n_dipendenti)

reg retrib03 ln_dipendenti c.eta##c.eta##c.eta tempo_d occ_manuale uomo i.settore i.anno

* we draw the relationship beteween income and ln firm size (we ignore the constant and other variables)

ge lin_log=_b[ln_dipendenti]*ln_dipendenti
sort n_dipendenti
twoway (line lin_log n_dipendenti) (line lineare n_dipendenti), ytitle("Predicted income")


*******************
*******************
* Log-lin model: _b*100 = % increase in Y for a unit increase in X

* let's try to transform the income variable into log

gen ln_income=ln(retrib03)

reg ln_income n_dipendenti c.eta##c.eta##c.eta tempo_d occ_manuale uomo i.settore i.anno

* we draw the relationship (this time we insert the constant to visualize better)

ge log_lin=exp(_b[_cons]+_b[n_dipendenti]*n_dipendenti)

sort n_dipendenti
twoway (line log_lin n_dipendenti) (line lineare n_dipendenti), ytitle("Predicted income")


*******************
*******************
* Log-log model: _b = % increase in Y for a 1% increase in X


reg ln_income ln_dipendenti c.eta##c.eta##c.eta tempo_d occ_manuale uomo i.settore i.anno

* we draw the relationship (again we insert the constant to visualize better)

ge log_log=exp(_b[_cons]+_b[ln_dipendenti]*ln_dipendenti)

sort n_dipendenti
twoway (line log_log n_dipendenti) (line lineare n_dipendenti), ytitle("Predicted income")


* 	3.	INTERACTIONS

*******************
*******************
* interaction between two dummy variables
gen int_uomo_manuale=uomo * occ_manuale

* the coefficient associated with int_uomo_manuale tells us what the additional effect of being in a manual occupation is for men compared to women

reg ln_income occ_manuale uomo int_uomo_manuale n_dipendenti c.eta##c.eta##c.eta tempo_d i.settore i.anno

* you can also achieve the same result with the following procedure, without having to create a new interaction variable

reg ln_income i.occ_manuale##i.uomo n_dipendenti c.eta##c.eta##c.eta tempo_d i.settore i.anno

*******************
*******************
* Interaction between dummy variable and continuous variable
gen int_uomo_dipendenti=uomo * ln_dipendenti

* the coefficient associated with int_uomo_dipendenti tells us what the additional effect of a unit increase in ln_dipendenti is for men compared with women

reg ln_income ln_dipendenti uomo int_uomo_dipendenti c.eta##c.eta##c.eta tempo_d occ_manuale i.settore i.anno

* again, the same result can be obtained with the following procedure

reg ln_income c.ln_dipendenti##i.uomo c.eta##c.eta##c.eta tempo_d occ_manuale i.settore i.anno


* 4.	FIXED-EFFECTS REGRESSION

* This database has a panel structure (the same workers are observed several times over time)

xtset id_soggetto anno


* In the "fixed effects" regression we basically enter a dummy variable for each individual (we do not show the coefficients, that would be too many!)
* this regression is used to condition income on unobservable characteristics of the individual, as long as they are constant characteristics over time
* adding fixed effects could make more credible our identifying assumptions
/*
 given a model
 Y = X b + E
we need Cov(X, E)=0 to casually interpret b
Restricting E by adding more information in X is a way
of making this assumption more credible
*/

* there are two equivalent ways of estimating fixed-effects regression:

* areg: includes dummies for each individual (without showing the coefficients associated with these dummies)

* xtreg: transforms the regression model without fixed effects into a model in which Y and X are expressed
* as deviation from the individual mean across all periods.
* This command exploits an equivalence between the models described above that can be derived analytically

areg ln_income ln_dipendenti c.eta##c.eta##c.eta tempo_d occ_manuale uomo i.settore i.anno, absorb(id_soggetto)

xtreg ln_income ln_dipendenti c.eta##c.eta##c.eta tempo_d occ_manuale uomo i.settore i.anno, fe

* variables that do not change for an individual over time (e.g., the variable "man") cannot be included in the regression because they are collinear with individual fixed effects

* the 2001 variable is collinear because it is linearly dependent on age and individual fixed effects:
* the reason is a bit more subtle:
/* 
including individual fixed effects is like estimating a model of this type

EY-Y = B_0 + B_1 (EX-X) + error

where EY and EX are the average of each individual's Y and X across all years.
In the case of X = age, assuming everyone is in the sample for 3 years we will have that for all individuals

Eage - age = -1 in the first year
Eage - age = 0 in the second year
Eage - age = 1 in the third year

In the second year i.2000 captures the conditional difference in Y between 1999 and 2000 (when age changes by 1 unit)
In the third year i.2001 captures the conditional difference in Y between 1999 and 2001 (when age changes by 2 units)
The conditional difference between 2001 and 2000 is always mechanically equal to i.2001-i.2000

As can be seen, there are not enough comparisons to estimate both b_eta and b_i.2000 and b_i.2001

Omitting b_i.2000 we will have that
b_eta: conditional difference between 1999 and 2000
b_i.2001: conditional difference between 2001 and 1999-2000

*/


* 4.	AKM REGRESSION MODEL AND ITS VARIANCE DECOMPOSITION

* this is a time-consuming estimation method. For this reason, we reduce the sample size by keeping a random 10% of workers
gen step=runiform()
bys id_soggetto: keep if step[1]<.1
drop step

/*
Card et al (2014) analyse the evolution of German wage inequality using an AKM regression model.

This is a fixed effects regression with individual and firm fixed effects, originally invented by Abowd Kramarz and Margolis (1999) - hence the name AKM

A firm fixed effect is a potentially causal estimate of the wage returns for working at a given firm, conditioned on workers' quality composition of the firm

*/

* reghdfe is a user-written command to estimate high-dimensional fixed effects models
* it has to be installed the first time you use it with the following command (you need an internet connection!)

*ssc install reghdfe, replace

* let's estimate the AKM regression model, saving the estimated firm and individual fixed effects into two new variables called fe_ind and fe_firm
reghdfe ln_income ln_dipendenti c.eta##c.eta##c.eta tempo_d occ_manuale uomo i.settore i.anno, absorb(fe_ind=id_soggetto fe_firm=id_azienda) resid

* reghdfe does not include singletons in the estimation sample (observations uniquely identified by a linear combination of the independent variables)
* the reason is that they inflate the number of observations, but do not contribute to the estimation of the parameters of interest in the regression, does including them could lead to an under-estimation of the standard errors

* we keep in the analysis only observations that were included in the estimation sample by the command reghdfe
keep if e(sample)

* reghdfe has created new variables equal to the estimated firm and worker fixed effects and the estimated residual
de 
rename _reg residuals

* we predict the wage using the estimated coefficients associated to time-varying controls
predict xb

* double check that the sum of all wage components is equal to actual wages
gen y=xb+fe_ind+fe_firm+residuals

su y ln_income

* we can estimate the AKM variance decomposition
/*
given the following regression model

Y = XB + FE_ind + FE_firm + RES

assuming E(RES|XB,FE_ind,FE_firm) = E(RES)

we have that

Var(Y) = Var(XB) + Var(FE_ind) + Var(FE_firm) + 2*Cov(XB,FE_ind) + 2*Cov(XB,FE_firm) + 2*Cov(FE_ind,FE_firm) + Var(RES)
*/

corr xb fe_firm, cov

gen var_xb=`r(Var_1)'
gen var_firm=`r(Var_2)'
gen cov_x_firm=2*`r(cov_12)'

corr fe_ind fe_firm, cov

gen var_ind=`r(Var_1)'
gen cov_ind_firm=2*`r(cov_12)'

corr fe_ind xb, cov
gen cov_x_ind=2*`r(cov_12)'

corr resid , cov

gen var_res=`r(Var_1)'

gen total_var=var_xb + var_firm + var_ind + cov_x_firm +cov_x_ind + cov_ind_firm + var_res

* AKM DECOMPOSITION OF THE TOTAL INCOME VARIANCE
su total_var var_xb  var_firm  var_ind  cov_x_firm cov_x_ind  cov_ind_firm var_res

* is the total variance actually equal to the variance of log income? (double check)
corr ln_income, cov


log close