---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- name: log: /Users/bernardofanfani/Desktop/teaching/research_topics_labor/lab_2/lecture2.log log type: text opened on: 16 Oct 2024, 10:07:10 . . . * 2. HYPOTHESIS TESTING AND UNIVARIATE REGRESSION . . * import an excel file on stata . import excel "dati_esercizio.xls", sheet("Sheet1") firstrow clear (4 vars, 12,281 obs) . . describe Contains data Observations: 12,281 Variables: 4 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- edu_father byte %10.0g edu_father donna byte %10.0g donna salario_mensile double %10.0g salario_mensile age byte %10.0g age ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . label var edu_father "=1 se padre con diploma o laurea" . label var donna "=1 se donna" . . * - Do women earn less than men? . . ttest salario_mensile, by(donna) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- 0 | 6,014 1421.057 8.499393 659.1278 1404.395 1437.719 1 | 6,267 1116.135 6.106033 483.3804 1104.165 1128.105 ---------+-------------------------------------------------------------------- Combined | 12,281 1265.455 5.377937 595.9813 1254.914 1275.997 ---------+-------------------------------------------------------------------- diff | 304.9214 10.40074 284.5343 325.3085 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = 29.3173 H0: diff = 0 Degrees of freedom = 12279 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000 . . *- Do the childern of more educated fathers earn more? . ttest salario_mensile, by(edu_father) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- 0 | 8,732 1248.228 6.103583 570.3503 1236.263 1260.192 1 | 3,549 1307.841 10.96024 652.9399 1286.352 1329.33 ---------+-------------------------------------------------------------------- Combined | 12,281 1265.455 5.377937 595.9813 1254.914 1275.997 ---------+-------------------------------------------------------------------- diff | -59.6137 11.85251 -82.84648 -36.38091 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = -5.0296 H0: diff = 0 Degrees of freedom = 12279 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000 . . /* > - Do women with highly educated fathers earn more than men with poorly educated fathers? > (It is recommended to preliminarily create a variable =1 for women with educated fathers and for men > with poorly educated fathers, =0 in other cases. Then use "ttest" with an "if" condition and the "by" option) > */ . . gen selezione=0 . replace selezione=1 if donna==1 & edu_father==1 (1,941 real changes made) . replace selezione=1 if donna==0 & edu_father==0 (4,406 real changes made) . . ttest salario_mensile if selezione==1, by(donna) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- 0 | 4,406 1410.162 9.319065 618.5782 1391.892 1428.432 1 | 1,941 1189.318 11.83667 521.4854 1166.105 1212.532 ---------+-------------------------------------------------------------------- Combined | 6,347 1342.625 7.521778 599.2454 1327.88 1357.37 ---------+-------------------------------------------------------------------- diff | 220.8432 16.08919 189.303 252.3835 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = 13.7262 H0: diff = 0 Degrees of freedom = 6345 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000 . . . /* > Estimate a regression in which the dependent variable is monthly salary and the > independent is age. Does getting one year older affect wages? What is the size of this effect? > */ . . reg salario_mensile age Source | SS df MS Number of obs = 12,281 -------------+---------------------------------- F(1, 12279) = 70.01 Model | 24729688.6 1 24729688.6 Prob > F = 0.0000 Residual | 4.3370e+09 12,279 353208.607 R-squared = 0.0057 -------------+---------------------------------- Adj R-squared = 0.0056 Total | 4.3618e+09 12,280 355193.662 Root MSE = 594.31 ------------------------------------------------------------------------------ salario_me~e | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- age | 3.73644 .4465441 8.37 0.000 2.861144 4.611737 _cons | 1117.741 18.45003 60.58 0.000 1081.576 1153.906 ------------------------------------------------------------------------------ . . . . * 3. EXPOLORATIVE ANALYSIS OF RAPPORTI DI LAVORO DATABASE AND APPEND . . use rapporti_lavoro_2000.dta, clear . des Contains data from rapporti_lavoro_2000.dta Observations: 604,815 Variables: 6 7 Sep 2022 12:02 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id_soggetto float %9.0g Codice identificativo lavoratore id_azienda float %9.0g Codice identificativo azienda anno int %9.0g retrib03 float %9.0g retribuzione annuale riportata ad euro del 2003 tempo_d byte %9.0g contratto a tempo determinato occ_manuale byte %8.0g occupazione manuale ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: . tab anno anno | Freq. Percent Cum. ------------+----------------------------------- 2000 | 604,815 100.00 100.00 ------------+----------------------------------- Total | 604,815 100.00 . . use rapporti_lavoro_2001.dta, clear . des Contains data from rapporti_lavoro_2001.dta Observations: 610,951 Variables: 6 7 Sep 2022 12:02 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id_soggetto float %9.0g Codice identificativo lavoratore id_azienda float %9.0g Codice identificativo azienda anno int %9.0g retrib03 float %9.0g retribuzione annuale riportata ad euro del 2003 tempo_d byte %9.0g contratto a tempo determinato occ_manuale byte %8.0g occupazione manuale ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: . tab anno anno | Freq. Percent Cum. ------------+----------------------------------- 2001 | 610,951 100.00 100.00 ------------+----------------------------------- Total | 610,951 100.00 . . append using rapporti_lavoro_2000.dta . append using rapporti_lavoro_1999.dta . . des Contains data from rapporti_lavoro_2001.dta Observations: 1,806,788 Variables: 6 7 Sep 2022 12:02 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id_soggetto float %9.0g Codice identificativo lavoratore id_azienda float %9.0g Codice identificativo azienda anno int %9.0g retrib03 float %9.0g retribuzione annuale riportata ad euro del 2003 tempo_d byte %9.0g contratto a tempo determinato occ_manuale byte %8.0g occupazione manuale ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . tab anno anno | Freq. Percent Cum. ------------+----------------------------------- 1999 | 591,022 32.71 32.71 2000 | 604,815 33.47 66.19 2001 | 610,951 33.81 100.00 ------------+----------------------------------- Total | 1,806,788 100.00 . save panel_rl.dta, replace file panel_rl.dta saved . . drop if anno==1999 (591,022 observations deleted) . tab anno anno | Freq. Percent Cum. ------------+----------------------------------- 2000 | 604,815 49.75 49.75 2001 | 610,951 50.25 100.00 ------------+----------------------------------- Total | 1,215,766 100.00 . . tab anno, sum(tempo_d) | Summary of contratto a tempo | determinato anno | Mean Std. dev. Freq. ------------+------------------------------------ 2000 | .09632202 .29503259 604,815 2001 | .09111205 .28776863 610,951 ------------+------------------------------------ Total | .09370389 .29141643 1,215,766 . tab tempo_d if anno==2000 contratto a | tempo | determinato | Freq. Percent Cum. ------------+----------------------------------- 0 | 546,558 90.37 90.37 1 | 58,257 9.63 100.00 ------------+----------------------------------- Total | 604,815 100.00 . tab tempo_d if anno==2001 contratto a | tempo | determinato | Freq. Percent Cum. ------------+----------------------------------- 0 | 555,286 90.89 90.89 1 | 55,665 9.11 100.00 ------------+----------------------------------- Total | 610,951 100.00 . . prtest tempo_d, by(anno) Two-sample test of proportions 2000: Number of obs = 604815 2001: Number of obs = 610951 ------------------------------------------------------------------------------ Group | Mean Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- 2000 | .096322 .0003794 .0955785 .0970656 2001 | .0911121 .0003682 .0903905 .0918336 -------------+---------------------------------------------------------------- diff | .00521 .0005286 .0041738 .0062461 | under H0: .0005286 9.86 0.000 ------------------------------------------------------------------------------ diff = prop(2000) - prop(2001) z = 9.8562 H0: diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(Z < z) = 1.0000 Pr(|Z| > |z|) = 0.0000 Pr(Z > z) = 0.0000 . . * 4. ADD VARIABLES WITH MERGE COMMAND AND REGRESSION ANALYSIS . use anagrafica_soggetti.dta, clear . des Contains data from anagrafica_soggetti.dta Observations: 754,942 Variables: 3 7 Sep 2022 12:01 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id_soggetto float %9.0g Codice identificativo lavoratore uomo byte %8.0g anno_nascita int %9.0g ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: id_soggetto . . use rapporti_lavoro_2001.dta, clear . des Contains data from rapporti_lavoro_2001.dta Observations: 610,951 Variables: 6 7 Sep 2022 12:02 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id_soggetto float %9.0g Codice identificativo lavoratore id_azienda float %9.0g Codice identificativo azienda anno int %9.0g retrib03 float %9.0g retribuzione annuale riportata ad euro del 2003 tempo_d byte %9.0g contratto a tempo determinato occ_manuale byte %8.0g occupazione manuale ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: . . merge 1:1 id_soggetto using anagrafica_soggetti.dta Result Number of obs ----------------------------------------- Not matched 143,991 from master 0 (_merge==1) from using 143,991 (_merge==2) Matched 610,951 (_merge==3) ----------------------------------------- . des Contains data from rapporti_lavoro_2001.dta Observations: 754,942 Variables: 9 7 Sep 2022 12:02 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id_soggetto float %9.0g Codice identificativo lavoratore id_azienda float %9.0g Codice identificativo azienda anno int %9.0g retrib03 float %9.0g retribuzione annuale riportata ad euro del 2003 tempo_d byte %9.0g contratto a tempo determinato occ_manuale byte %8.0g occupazione manuale uomo byte %8.0g anno_nascita int %9.0g _merge byte %23.0g _merge Matching result from merge ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . tab _m Matching result from | merge | Freq. Percent Cum. ------------------------+----------------------------------- Using only (2) | 143,991 19.07 19.07 Matched (3) | 610,951 80.93 100.00 ------------------------+----------------------------------- Total | 754,942 100.00 . keep if _m==3 (143,991 observations deleted) . drop _m . . * create the variable age in years . gen age = anno - anno_nascita . . su retrib03 age Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- retrib03 | 610,951 37372.78 24332.5 479 399924 age | 610,951 36.08291 9.439077 19 60 . . * linear regression . reg retrib03 age Source | SS df MS Number of obs = 610,951 -------------+---------------------------------- F(1, 610949) = 53104.31 Model | 2.8927e+13 1 2.8927e+13 Prob > F = 0.0000 Residual | 3.3280e+14 610,949 544723543 R-squared = 0.0800 -------------+---------------------------------- Adj R-squared = 0.0800 Total | 3.6173e+14 610,950 592070499 Root MSE = 23339 ------------------------------------------------------------------------------ retrib03 | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- age | 728.9883 3.163412 230.44 0.000 722.7882 735.1885 _cons | 11068.76 117.986 93.81 0.000 10837.51 11300.01 ------------------------------------------------------------------------------ . . predict y_hat (option xb assumed; fitted values) . . /* > this is a trick to make a graph file size smaller: > scatter prints a point for each nonmissing observation > since we have 600'000 observations, this is very heavy and slow > y_hat has the same value for each age level, thus > we need only one point per age value. > SOLUTION: > - create a variable (we call it "step"") =1 in the first row of each age value, 0 otherwise > - do the scatterplot only for obsevations where step=1 > */ . bys age: gen step=1 if _n==1 (610,909 missing values generated) . . scatter y_hat age if step==1 . . gen age2=age^2 . . reg retrib03 age age2 Source | SS df MS Number of obs = 610,951 -------------+---------------------------------- F(2, 610948) = 28343.32 Model | 3.0713e+13 2 1.5356e+13 Prob > F = 0.0000 Residual | 3.3101e+14 610,948 541801532 R-squared = 0.0849 -------------+---------------------------------- Adj R-squared = 0.0849 Total | 3.6173e+14 610,950 592070499 Root MSE = 23277 ------------------------------------------------------------------------------ retrib03 | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- age | 2101.942 24.12201 87.14 0.000 2054.664 2149.221 age2 | -18.01879 .3138604 -57.41 0.000 -18.63395 -17.40364 _cons | -13405.97 442.2544 -30.31 0.000 -14272.77 -12539.16 ------------------------------------------------------------------------------ . . predict y_hat2 (option xb assumed; fitted values) . . * the variable step is useful also for this graph . twoway (scatter y_hat2 age if step==1) (scatter y_hat age if step==1), /// > legend(order(1 "Modello quadratico" 2 "Modello lineare")) . . * marginal effect of moving from 39 to 40 years old . ttest y_hat if age==39|age==40, by(age) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- 39 | 17,955 39499.31 0 0 39499.31 39499.31 40 | 17,175 40228.3 0 0 40228.3 40228.3 ---------+-------------------------------------------------------------------- Combined | 35,130 39855.71 1.944243 364.4095 39851.9 39859.52 ---------+-------------------------------------------------------------------- diff | -728.9883 0 -728.9883 -728.9883 ------------------------------------------------------------------------------ diff = mean(39) - mean(40) t = . H0: diff = 0 Degrees of freedom = 35128 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = . Pr(|T| > |t|) = . Pr(T > t) = . . . ttest y_hat2 if age==39|age==40, by(age) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- 39 | 17,955 41163.2 0 0 41163.2 41163.2 40 | 17,175 41841.66 0 0 41841.66 41841.66 ---------+-------------------------------------------------------------------- Combined | 35,130 41494.9 1.809474 339.1497 41491.35 41498.44 ---------+-------------------------------------------------------------------- diff | -678.457 0 -678.457 -678.457 ------------------------------------------------------------------------------ diff = mean(39) - mean(40) t = . H0: diff = 0 Degrees of freedom = 35128 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = . Pr(|T| > |t|) = . Pr(T > t) = . . . . * marginal effect of moving from 29 to 30 years old . ttest y_hat if age==29|age==30, by(age) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- 29 | 26,000 32209.42 0 0 32209.42 32209.42 30 | 25,759 32938.41 0 0 32938.41 32938.41 ---------+-------------------------------------------------------------------- Combined | 51,759 32572.22 1.602123 364.4927 32569.08 32575.36 ---------+-------------------------------------------------------------------- diff | -728.9863 0 -728.9863 -728.9863 ------------------------------------------------------------------------------ diff = mean(29) - mean(30) t = . H0: diff = 0 Degrees of freedom = 51757 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = . Pr(|T| > |t|) = . Pr(T > t) = . . . ttest y_hat2 if age==29|age==30, by(age) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- 29 | 26,000 32396.55 0 0 32396.55 32396.55 30 | 25,759 33435.39 0 0 33435.39 33435.39 ---------+-------------------------------------------------------------------- Combined | 51,759 32913.55 2.283084 519.4154 32909.08 32918.03 ---------+-------------------------------------------------------------------- diff | -1038.832 0 -1038.832 -1038.832 ------------------------------------------------------------------------------ diff = mean(29) - mean(30) t = . H0: diff = 0 Degrees of freedom = 51757 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = . Pr(|T| > |t|) = . Pr(T > t) = . . . . * we calculate the average of the actual wages and predictions for each age: . * the command generates a new database with one observation per age. . * the variables in the new database contain the averages of retrib03 y_hat y_hat2 by age . collapse (mean) retrib03 y_hat y_hat2, by(age) . . twoway (connected retrib03 age, msize(vsmall)) /// > (connected y_hat age, msize(vsmall)) /// > (connected y_hat2 age, msize(vsmall)) . . . * we can improve the graph appearance (twoway has a lot of options to play with!) . twoway (connected retrib03 age, msize(vsmall)) /// > (connected y_hat age, msize(vsmall)) /// > (connected y_hat2 age, msize(vsmall)), /// > legend(order(1 "Actual wages" 2 "Linear prediction" 3 "Quadratic prediction")) /// > title("Actual average and predicted wage by age") /// > ytitle("Wage level") xtitle("Age") . . . . * 5. close the log file . log close name: log: /Users/bernardofanfani/Desktop/teaching/research_topics_labor/lab_2/lecture2.log log type: text closed on: 16 Oct 2024, 10:07:16 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------