
eof分析
skdj指标-厉行节约
2023年2月18日发(作者:气瓶充装许可证)5.4主成分聚类与主成分回归
5.4.1变量聚类与样品分类
主成分分析可用于聚类:变量聚类与样品聚类。
变量聚类:由主成分系数的差异,可将变量聚类。例如例5.5中第2主成分中murder,rape,
assult系数为负的,burglary,larceny,auto系数是正的。按系数正负可把7个变量分为两
类:murder,rape,assult属于暴力程度严重的一类;burglary,larceny,auto属于暴力程
度较轻的一类。按照这种方法,根据主成分系数的正负可以将变量聚类。
样品聚类:如果2个主成分能很好的概括随机向量的信息,计算每个样品的这两个主成
分得分,把他们的散点图画出来,就能从图上将样品分类。
例5.5(续2)按照第一、第二主成分得分,画出散点图
datacrime;/*建立数据集crime*/
inputstate$1-15murderraperobberyassultburglarylarcenyauto;
/*建立变量statemurderraperobberyassultburglarylarceny
auto。state$1-15表示前15列存州名。murderraperobberyassultburglarylarcenyauto
表7种罪的犯罪率*/
cards;/*以下为数据体*/
Albama14.225.296.8278.31135.51881.9280.7
Alaska10.851.696.8284.01331.73369.8753.3
Arirona9.534.2138.2312.32346.14467.4439.5
Arkansas8.834.2138.2312.32346.14467.4439.5
Califonia11.549.4287.0358.02139.43499.8663.5
Colorado6.342.0170.7292.91935.23903.2477.1
Conecticat4.216.8129.5131.81346.02620.7593.2
Delaware6.024.9157.0194.21682.63678.4467.0
Florida10.239.6187.9449.11859.93840.5351.4
Geogia11.731.1140.5256.51351.12170.2297.9
Hawaii7.225.5128.064.11911.53920.4489.4
Idaho5.519.439.6172.51050.82599.6237.6
Illinois9.921.8211.3209.01085.02828.5528.6
Indiana7.426.5123.2153.51086.22498.7377.4
Iowa2.310.641.289.8812.52685.1219.9
Kansas6.622.0100.7180.51270.42739.3244.3
Kentaky10.119.181.1123.3872.21662.1245.4
Loisana15.530.9142.9335.51165.52469.9337.7
Maine2.413.538.7170.01253.12350.7246.9
Maryland8.034.8292.1358.91400.03177.7428.5
Masschusetts3.120.8169.1231.61532.22311.31140.1
Michigan9.338.9261.9274.61522.73159.0545.5
Minnesota2.719.585.985.81134.72559.3343.1
Mississippi14.319.665.7189.1915.61239.9144.4
Missouri9.628.3189.0233.51318.32424.2378.4
Montana5.416.739.2156.8804.92773.2309.3
Nebraska3.918.164.7112.7760.02316.1249.1
Nevada15.849.1323.1355.02453.14212.6559.2
MewHampashare3.210.723.276.01041.72343.9293.4
NewJersey5.621.0180.4185.11435.82774.5511.5
NewMaxico8.839.1109.6343.41418.73008.6259.5
NewYork10.729.4472.6319.11728.02782.0745.8
NorthCarolina10.617.061.3318.31154.12037.8192.1
NorthDakoda100.99.013.343.8446.11843.0144.7
Ohio7.827.3190.5181.11216.02696.8400.4
Oklahoma8.629.273.8205.01288.22228.1326.8
Oregan4.939.9124.1286.91636.43506.1388.9
Pennsyvania5.619.0130.3128.0877.51624.1333.2
RhodeIsland3.610.586.5201.01849.52844.1791.4
SouthCarolina11.933.0105.9485.31613.62342.4245.1
SouthDakoda2.013.517.9155.7570.51704.4147.5
Tennessee10.129.7145.8203.91259.71776.5314.0
Texas13.333.8152.4208.21603.12988.7397.6
Utah3.520.368.8147.31171.63004.6334.5
Vermont1.415.930.8101.21348.22201.0265.2
Virginia9.023.392.1165.7986.22521.2226.7
Wasinton4.339.6106.2224.81605.63386.9360.3
WestViginia6.013.242.290.9597.41341.7163.3
Wiskonsin2.812.952.263.7846.92614.2220.7
Wyoming5.421.939.7173.9811.62772.2282.0
;
procprincompout=crimprinn=2;
varmurderraperobberyassultburglarylarcenyauto;
run;
PROCPLOTdata=crimprin;
PLOTPRIN2*PRIN1=STATE/VPOS=31;
TITLE2‘PLOTOFTHEFIRSTTWOPRINCIPALCOMPONENTS’;
RUN;
例5.7(气温分析)本例的输入资料文件(TEMPERAT)是美国六十四个城市一月与七
月的平均日温。
DATATEMPERAT;
TITLE2'MEANTEMPERATUREINJANUARYANDJULYFORSELECTEDCITIES';
INPUTCITY$1-15JANUARYJULY;
CARDS;
MOBILE51.281.6
PHOENIX51.291.2
LITTLEROCK39.581.4
SACRAMENTO45.175.2
DENVER29.973.0
HARTFORD24.872.7
WILMINGTON32.075.8
WASHINGTONDC35.678.7
JACKSONVILLE54.681.0
MIAMI67.282.3
ATLANTA42.478.0
BOISE29.074.5
CHICAGO22.971.9
PEORIA23.875.1
INDIANAPOLIS27.975.0
DESMOINES19.475.1
WICHITA31.380.7
LOUISVILLE33.376.9
NEWORLEANS52.981.9
PORTLAND,MAINE21.568.0
BALTIMORE33.476.6
BOSTON29.273.3
DETROIT25.573.3
SAULTSTEMARIE14.263.8
DULUTH8.565.6
MINNEAPOLIS12.271.9
JACKSON47.181.7
KANSASCITY27.878.8
STLOUIS31.378.6
GREATFALLS20.569.3
OMAHA22.677.2
RENO31.969.3
CONCORD20.669.7
ATLANTICCITY32.775.1
ALBUQUERQUE35.278.7
ALBANY21.572.0
BUFFALO23.770.1
NEWYORK32.276.6
CHARLOTTE42.178.5
RALEIGH40.577.5
BISMARCK8.270.8
CINCINNATI31.175.6
CLEVELAND26.971.4
COLUMBUS28.473.6
OKLAHOMACITY36.881.5
PORTLAND,OREG38.167.1
PHILADELPHIA32.376.8
PITTSBURGH28.171.9
PROVIDENCE28.472.1
COLUMBIA45.481.2
SIOUXFALLS14.273.3
MEMPHIS40.579.6
NASHVILLE38.379.6
DALLAS44.884.8
ELPASO43.682.3
HOUSTON52.183.3
SALTLAKECITY28.076.7
BURLINGTON16.869.8
NORFOLK40.578.3
RICHMOND37.577.9
SPOKANE25.469.7
CHARLESTON,WV34.575.0
MILWAUKEE19.469.9
CHEYENNE26.669.1
;
PROCPLOT;
PLOTJULY*JANUARY=CITY/VPOS=36;
PROCPRINCOMPCOVOUT=PRIN;
VARJULYJANUARY;
PROCPLOT;
PLOTPRIN2*PRIN1=CITY/VPOS=26;
TITLE3'PLOTOFPRINCIPALCOMPONENTS';
Run;
例5.8美国大学生篮球队排名
databballm;
labelcsn='CommunitySportsNews(ChapelHillNC)'
dursun='DurhamSun'
durher='DurhamMorningHerald'
waspost='WashingtonPost'
usatoda='USAToday'
spormag='SportMagazine'
insport='InsideSports'
upi='UnitedPressInternational'
ap='AssociatedPress'
sporill='SportsIllustrated'
;
title1'Pre-Season1985CollegeBasketballRankings';
inputschool$sundurherwaspostusatoda
spormaginsportupiapsporill;
formatcsn--sporill5.1;
cards;
Louisville
GeorgiaTech2243111211
Kansas34515118457
Michigan4594253132
Duke56754104565
UNC6122342323
Syracuse71
NotreDame881312.
Kentucky91121113
LSU10913.1315169148
DePaul11.211520.19..19
Georgetown
Navy132.20.
Illinois6
Iowa1516..23..14.20
Arkansas16...25....16
MemphisState17.11.16820.1512
Washington18......17..
UAB191310.1217.161615
UNLV2018181922.141818.
NCState2117141615.12151718
Maryland22...19...1914
Pittsburg23.........
Oklahoma247.1317
Indiana2512201821.....
Virginia26.22..18....
OldDominion27.........
Auburn281011
29....14....
UCLA30......19..
's..19.......
Tennessee..24..16....
Montana...20......
Houston....24.....
VirginiaTech......13...
;
procprincompdata=bballn=1out=pcbballstandard;
varcsn--sporill;
weightweight;
procsortdata=pcbball;
byprin1;
procprint;
varschoolprin1;
title2'CollegeTeamsasOrderedbyPRINCOMP';
run;
例5.955个地区或国家的赛跑纪录如表5-7,试作主成分分析,并将55个国家或地区按赛
跑成绩分类。
表5-755个地区或国家的赛跑纪录
序号国家或
地区
100m
(秒)
200m
(秒)
400m
(秒)
800m
(分)
1500m
(分)
5000m
(分)
10000m
(分)
马拉松
(分)
1argentin10.3920.8146.841.813.7014.0429.36137.72
2australi10.3120.0644.841.743.5713.2827.66128.30
3austria10.4420.8146.821.793.6013.2627.72135.90
4belgium10.3420.6845.041.733.6013.2227.45129.95
5bermuda10.2820.5845.911.803.7514.6830.55146.62
6brazil10.2220.4345.211.733.6613.6228.62133.13
7burma10.6421.5248.301.803.8514.4530.28139.95
8canada10.1720.2245.681.763.6313.5528.09130.15
9chile10.3420.8046.201.793.7113.6129.30134.03
10china10.5121.0447.301.813.7313.9029.13133.53
11columbia10.4321.0546.101.823.7413.4927.88131.35
12cookis12.1823.2052.942.024.2416.7035.38164.70
13costa10.9421.9048.661.873.8414.0328.81136.58
14czech10.3520.6545.641.763.5813.4228.19134.32
15denmark10.5620.5245.891.783.6113.5028.11130.78
16domrep10.1420.6546.801.823.8214.9131.45154.12
17finland10.4320.6945.491.743.6113.2727.52130.87
18france10.1120.3845.281.733.5713.3427.97132.30
19gdr10.1220.3344.871.733.5613.1727.42129.92
20frg10.1620.3744.501.733.5313.2127.61132.23
21gbni10.1120.2144.931.703.5113.0127.51129.13
22greece10.2220.7146.561.783.6414.5928.45134.60
23guatemal10.9821.8248.401.893.8014.1630.11139.33
24hungary10.2620.6246.021.773.6213.4928.44132.58
25india10.6021.4245.731.763.7313.7728.81131.98
26indonesi10.5921.4947.801.843.9214.7330.79148.83
27ireland10.6120.9646.301.793.5613.3227.81132.35
28israel10.7121.0047.801.773.7213.6628.93137.55
29italy10.0119.7245.261.733.6013.2327.52131.08
30japan10.3420.8145.861.793.6413.4127.72128.63
31kenya10.4620.6644.921.733.5513.1027.38129.75
32korea10.3420.8946.901.793.7713.9629.23136.25
33dprkorea10.9121.9447.301.853.7714.1329.67130.87
34luxembou10.3520.7747.401.823.6713.6429.08141.27
35malaysia10.4020.9246.301.823.8014.6431.01154.10
36mauritiu11.1922.4547.701.883.8315.0631.77152.23
37mexico10.4221.3046.101.803.6513.4627.95129.20
38netherla10.5220.9545.101.743.6213.3627.61129.02
39nz10.5120.8846.101.743.5413.2127.70128.98
40norway10.5521.1646.711.763.6213.3427.69131.48
41png10.9621.7847.901.904.0114.7231.36148.22
42philippi10.7821.6446.241.813.8314.7430.64145.27
43poland10.1620.2445.361.763.6013.2927.89131.58
44portugal10.5321.1746.701.793.6213.1327.38128.65
45rumania10.4120.9845.871.763.6413.2527.67132.50
46singapor10.3821.2847.401.883.8915.1131.32157.77
47spain10.4220.7745.981.763.5513.3127.73131.57
48sweden10.2520.6145.631.773.6113.2927.94130.63
49switzerl10.3720.4645.781.783.5513.2227.91131.20
50taipei10.5921.2946.801.793.7714.0730.07139.27
51thailand10.3921.0947.911.833.8415.2332.56149.90
52turkey10.7121.4347.601.793.6713.5628.58131.50
53usa9.9319.7543.861.733.5313.2027.43128.22
54ussr10.0720.0044.601.753.5913.2027.53130.55
55wsamoa10.8221.8649.002.024.2416.2834.71161.83
可用下列SAS程序作主成分分析,并将第1,2主成分画散点图
datarunrecod;
inputcountry$x1-x8;
cards;
argentin10.3920.8146.841.813.7014.0429.36137.72
australi10.3120.0644.841.743.5713.2827.66128.30
austria10.4420.8146.821.793.6013.2627.72135.90
belgium10.3420.6845.041.733.6013.2227.45129.95
bermuda10.2820.5845.911.803.7514.6830.55146.62
brazil10.2220.4345.211.733.6613.6228.62133.13
burma10.6421.5248.301.803.8514.4530.28139.95
canada10.1720.2245.681.763.6313.5528.09130.15
chile10.3420.8046.201.793.7113.6129.30134.03
china10.5121.0447.301.813.7313.9029.13133.53
columbia10.4321.0546.101.823.7413.4927.88131.35
cookis12.1823.2052.942.024.2416.7035.38164.70
costa10.9421.9048.661.873.8414.0328.81136.58
czech10.3520.6545.641.763.5813.4228.19134.32
denmark10.5620.5245.891.783.6113.5028.11130.78
domrep10.1420.6546.801.823.8214.9131.45154.12
finland10.4320.6945.491.743.6113.2727.52130.87
france10.1120.3845.281.733.5713.3427.97132.30
gdr10.1220.3344.871.733.5613.1727.42129.92
frg10.1620.3744.501.733.5313.2127.61132.23
gbni10.1120.2144.931.703.5113.0127.51129.13
greece10.2220.7146.561.783.6414.5928.45134.60
guatemal10.9821.8248.401.893.8014.1630.11139.33
hungary10.2620.6246.021.773.6213.4928.44132.58
india10.6021.4245.731.763.7313.7728.81131.98
indonesi10.5921.4947.801.843.9214.7330.79148.83
ireland10.6120.9646.301.793.5613.3227.81132.35
israel10.7121.0047.801.773.7213.6628.93137.55
italy10.0119.7245.261.733.6013.2327.52131.08
japan10.3420.8145.861.793.6413.4127.72128.63
kenya10.4620.6644.921.733.5513.1027.38129.75
korea10.3420.8946.901.793.7713.9629.23136.25
dprkorea10.9121.9447.301.853.7714.1329.67130.87
luxembou10.3520.7747.401.823.6713.6429.08141.27
malaysia10.4020.9246.301.823.8014.6431.01154.10
mauritiu11.1922.4547.701.883.8315.0631.77152.23
mexico10.4221.3046.101.803.6513.4627.95129.20
netherla10.5220.9545.101.743.6213.3627.61129.02
nz10.5120.8846.101.743.5413.2127.70128.98
norway10.5521.1646.711.763.6213.3427.69131.48
png10.9621.7847.901.904.0114.7231.36148.22
philippi10.7821.6446.241.813.8314.7430.64145.27
poland10.1620.2445.361.763.6013.2927.89131.58
portugal10.5321.1746.701.793.6213.1327.38128.65
rumania10.4120.9845.871.763.6413.2527.67132.50
singapor10.3821.2847.401.883.8915.1131.32157.77
spain10.4220.7745.981.763.5513.3127.73131.57
sweden10.2520.6145.631.773.6113.2927.94130.63
switzerl10.3720.4645.781.783.5513.2227.91131.20
taipei10.5921.2946.801.793.7714.0730.07139.27
thailand10.3921.0947.911.833.8415.2332.56149.90
turkey10.7121.4347.601.793.6713.5628.58131.50
usa9.9319.7543.861.733.5313.2027.43128.22
ussr10.0720.0044.601.753.5913.2027.53130.55
wsamoa10.8221.8649.002.024.2416.2834.71161.83
procprincompout=w;
varx1-x8;
run;
由执行程序后得到的表可作如下分析。以下是特征值表
Eigenvalues
EigenvalueDifferenceProportionCumulative
PRIN16.622155.744530.8277680.82777
PRIN20.877620.718300.1097020.93747
PRIN30.159320.035270.0199150.95739
PRIN40.124050.044170.0155060.97289
PRIN50.079880.011920.0099850.98288
PRIN60.067970.021550.0084960.99137
PRIN70.046420.023820.0058020.99717
PRIN80.02260.0.0028251.00000
两个主成分的累计方差贡献率为0.9375,可见只要两个主成分就够了。以下是由特征向量表
Eigenvectors
PRIN1PRIN2PRIN3PRIN4PRIN5PRIN6PRIN7PRIN8
X10.3175560.5668780.3322620.1276280.262555-.5937040.1362410.105542
X20.3369790.4616260.360657-.259116-.1539570.656137-.112640-.096054
X30.3556450.248273-.5604670.652341-.2183230.156625-.002854-.000127
X40.3686840.012430-.532482-.4799990.540053-.014692-.238016-.038165
X50.372810-.139797-.153443-.404510-.487715-.1578430.6100110.139291
X60.364374-.3120300.1897640.029588-.253979-.141299-.5912990.546697
X70.366773-.3068600.1817520.080069-.133176-.219017-.176871-.796795
X80.341926-.4389630.2632090.2995120.4979280.3152850.3988220.158164
第1主成分个分量系数是大体相同的正数,第1主成分反映各国或地区跑步实力,它越小
速度快,实力越强。第2主成分在100米,200米,400米的系数为正,1500米,5000米,10000
米和马拉松系数为负,且距离越长,负的越多,反映短距离跑速度与长距离跑速度差异,即
长距离跑相对短距离跑的实力,它越小短距离相对长距离实力越强。
为了将第1主成分作为横轴,第2主成分作为立轴画出散点图,可用下列程序
procgplotdata=w;
plotprin2*prin1=country;
run;
得图5.1
图5.1径塞记录第1,2主成分散点图
COUNTRYargentinaustraliaustriabelgiumbermuda
brazilburmacanadachilechina
columbiacookiscostaczechdenmark
domrepdprkoreafinlandfrancefrg
gbnigdrgreeceguatemalhungary
indiaindonesiirelandisraelitaly
japankenyakorealuxemboumalaysia
mauritiumexiconetherlanorwaynz
philippipngpolandportugalrumania
singaporspainswedenswitzerltaipei
thailandturkeyusaussrwsamoa
PRIN2
-3
-2
-1
0
1
2
PRIN1
-4-3-2-111
从图5.1可见这些国家或地区可分为4类:第1类只有库克群岛,第1,2主成分都最大,说明
跑步速度慢,长距离跑相对短距离跑的实力大(短距离跑速度特别慢);第2类包括百慕大,
多米尼加,马来西亚,新加坡,泰国第1主成分较大,第2主成分为负值,说明总体来说,跑
步速度不快,短距离跑相对长距离跑的实力强,即短距离跑得快而长距离跑得慢;第3类包
括中国台北,朝鲜,哥斯达尼加,巴布亚新几内亚,危地马拉,毛里求斯,缅甸,菲律宾,
印度尼西亚,第1主成分较大,说明总体来说,跑步速度不快,第2主成分为正值,说明短距
离跑相对长距离跑的实力小(长距离跑速度较快);其他国家或地区为一类;第1主成分较
小,说明跑步速度总的较快,第2主成分为正值,说明短距离跑得快而长距离跑得慢。
5.4.2主成分回归
在回归分析中,常遇到自变量存在多重共线性问题,即自变量的观测值存在线性相关,
或近似线性相关。这时设计矩阵满足0'XX,第4章中已指出,用公式
YXXX')'(1
估计参数会造成较大方差。选取自变量的主成分,它们是彼此正交的。
用少量主成分作回归,再将主成分化为原始变量,这样得到的回归方程就不存在较大方差了。
与逐步回归相比,这样做的好处是,所有变量都被计算,而逐步回归略去一些自变量,这些
自变量不被计算。主成分回归原理见例5.10。
例5.10(主成分回归的原理)设自变量x1,x2,x3和因变量y的观测值如表5-8,以x1,x2,x3
为自变量y为因变量作回归。由于x3的值近似为x1与x2的和,存在多重共线性,直接以
x1,x2,x3为自变量不好,作主成分回归。
首先将x1,x2,x3零均值化,得到x1*,x2*,x3*
x1*=(x1-0.22)/0.49193,
x2*=(x2-0.24)/0.53666,
x3*=(x3-0.46)/0.50794
用相关阵计算主成分(计算所得表略去),主成分得分见表5-9,因为前一、两个主成分的
累计百分比分别为0.576和0.992,取前两个主成分为自变量(略去1个主成分)。由表5-9
数据作以y为因变量,Prin1和Prin2为自变量的回归方程
y=1.48+1.40128Prin1+0.31623Prin2
有主成分分析算得结果,Prin1,Prin2可用x1*,x2*,x3*表示:
prin1=0.461761x1*+0.461761x2*+0.757333x3*
prin2=0.707107x1*-0.707107x2*
通过变量替换,变成以x1,x2,x3为自变量的回归方程
y=-0.0598224+1.7699x1+0.789043x2+2.08929x3
这就是主成分回归所得方程。
表5-8变量观测值
X1x2x3y
1.1014
01.213
000.30.3
0000.1
0000
表5-9变量主成分得分(用协差阵计算)
YPrin1Prin2Prin3
4.00.551320.905250.01335
3.00.88824-0.687390.01118
0.3-0.33232-0.04123-0.13951
0.1-0.55362-0.088320.05749
0.0-0.55362-0.088320.05749
上述步骤很麻烦,可用SAS-REG过程实现:在procreg语句中,加选项outest=文件名(存
储主成分回归的输出数据集)在model语句自变量后加“/”号,再加选项pcomit=k(k为略去
的主成分个数)。执行后看最后一张表最后一行即可。
例5.111978-1993我国民航客运量y(万人),国民收入x1(亿元),消费额x2(亿元),
铁路客运量x3(万人),民航航线里程数x4(万公里),来华旅游人数x5(万人)如表5-10。
试建立以民航客运量为因变量的回归方程。
表5-10我国民航数据表
yearyx1x2x3x4x5
1978231314.89180.92
1979298335.00420.39
1989220419.53570.25
1981430021.82776.71
3.27792.43
335810604422.91947.70
39.021285.22
.721783.30
555210857932.432281.95
8.912690.23
388.383169.48
769.192450.14
0.681746.20
571.913335.65
83.663311.50
8296.084152.70
解直接建立回归方程发现,回归方程消费额x2的系数是负的,这与实际情况不符:消费额
越多,应当乘飞机人数越多;而回归方程x2的系数是负的,导出错误结论:消费额越多,则
乘飞机人数越少。通过检验发现,x2的系数是负的原因是:自变量出现多重共线性。本例中
自变量有5个,主成分也有5个。通过主成分分析,前3个主成分累计百分比为99.61,可以选
用前3个主成分,即去掉2个主成分,于是可用如下SAS程序计算
dataairpline;
doyear=1978to1993;
inputyx1-x5;
output;
end;
cards;
231314.89180.92
298335.00420.39
343368825319220419.53570.25
430021.82776.71
44542583.27792.43
39604422.91947.70
554565239.021285.22
7447027.721783.30
9977859555210857932.432281.95
13142938.912690.23
1442117388.383169.48
1283131769.192450.14
16671250.681746.20
2178165571.913335.65
288629383.663311.50
33832488296.084152.70
;
procreg;
modely=x1-x5/selection=stepwise;
modely=x1-x5/selection=stepwise;
run;
procregoutest=w;/*主成分回归计算结果存数据集wu*/
modely=x1-x5/pcomit=2;/*实行主成分分析,去掉2个主成分*/
procprintdata=w;
run;
执行后输出的最后一个表为
__I
_D_PN
M_ERC_T
OTPIORE
DYVDMMR
OEPAGISC
BLERETEEXXXXX
S______P12345
Y
1MODEL1PARMSY..75.9488622.3270.13433-0.15715-0.00973818.44350.29278-1
2MODEL1IPCY.296.1468702.5730.038780.05967-0.0106489.90570.21851-1
这表倒数第2行和最后一行分别是:不减少主成分和减少两个主成分所得参数估计。由这表
最后一行可见主成分方法得到的回归方程就是
y=702.573+0.03878x1+0.05967x2-.010648x3+9.9057x4+0.21851x5
其中消费额x2的系数就是正的了。
dataUSairpollution;
inputcity$SO2tempmanupopulwindprecippredays;
cards;
Albany4647.6441168.833.36135
Albuquerque1156.8462448.97.7758
Atlanta2461.53684979.148.34115
Baltimore4755.06259059.641.31111
Buffalo1147.139146312.436.11166
Charleston3155.235716.540.75148
Chicago11050.63344336910.434.44122
Cincinnati2354.04624537.139.04132
Cleveland6549.7100775110.934.99155
Columbus2651.52665408.637.01134
Dallas966.264184410.935.9478
Denver1751.94545159.012.9586
DesMoines1749.010420111.230.85103
Detroit3549.91064151310.130.96129
Hartford5649.14121589.043.37127
Houston1068.9721123310.848.19103
Indianapolis2852.33617469.738.74121
Jacksonville1468.41365298.854.47116
KansasCity1454.538150710.037.0099
LittleRock1361.0911328.248.52100
Louisville3055.62915938.343.11123
Memphis1061.63376249.249.10105
Miami1075.52073359.059.80128
Milwaukee1645.756971711.829.07123
Minneapolis2943.569974410.625.94137
Nashville1859.42754487.946.00119
NewOrleans968.32043618.456.77113
Norfolk3159.39630810.644.68116
Omaha1451.518134710.930.1898
Philadelphia6954.6169219509.639.93115
Phoenix1070.32135826.07.0536
Pittsburgh6150.43475209.436.22147
Providence9450.034317910.642.75125
Richmond2657.81972997.642.59115
SaltLakeCity2851.01371768.715.1789
SanFrancisco1256.74537168.720.6667
Seattle2951.13795319.438.79164
5655.97756229.535.89105
Washington2957.34347579.338.89111
Wichita856.612527712.730.5882
Wilmington3654.080809.040.25114
;
procreg;
modelSO2=tempmanupopulwindprecippredays/selection=stepwise;
run;
procregoutest=w;/*主成分回归计算结果存数据集wu*/
modelSO2=tempmanupopulwindprecippredays/pcomit=3;
procprintdata=w;
run;
Model:MODEL1
DependentVariable:SO2
NumberofObservationsRead41
NumberofObservationsUsed36
NumberofObservationswithMissingValues5
AnalysisofVariance
SumofMean
SourceDFSquaresSquareFValuePr>F
Model6137862297.6195010.46<.0001
Error296369.92191219.65248
CorrectedTotal3520156
RootMSE14.82068R-Square0.6840
DependentMean31.19444AdjR-Sq0.6186
CoeffVar47.51063
ParameterEstimates
ParameterStandard
VariableDFEstimateErrortValuePr>|t|
Intercept1133.6761650.055242.670.0123
temp1-1.529920.65951-2.320.0276
manu10.066920.017173.900.0005
popul1-0.042360.01649-2.570.0156
wind1-3.383461.88276-1.800.0827
precip10.804160.397602.020.0524
predays1-0.172060.17864-0.960.3434
5.5经验正交函数分解
经验正交函数分解也称为经验正交分解或自然正交分解,其算法类似主成分分析,但
含义不同,分析方法不同。主成分分析是对随机向量作分析,而经验正交函数分解是对确定
性变量进行分析。由于经验正交函数分解在气象上的用途较大,在气象上称为EOF方法,见
吴洪宝(2005)。
对气象要素场的分解,目前已有多种方法,如谐波分析、球函数分解、Chebixief分解。
与它们相比,经验正交函数分解更有其优越性。(1)没有固定的函数形式,因而不需许多
数学假设,更易符合实际。(2)能在不规则分布站点使用。(3)既可以在空间不同点进行
分解,也可在同一站点对不同时间点分解,也可对同一站点不同要素做分解。
为了便于叙述经验正交分解,我们举不同站点,不同时间气温的例5.12和例5.13作为例
子;容易看出气温可换为任何其他指标,站点这一属性变量也可换为别的属性变量,如例5.14
中换为害虫的品种。
例5.12设有p个站点,在i时刻j个站点温度观测值为
ij
t,温度矩阵为
npn
p
p
t......t
............
t......t
t......t
1
221
111
,令
n
1k
kjijij
t
n
1
tf,
即
ij
f是各站点观测值减去对时间的平均值(距平),
ij
f形成矩阵
npn
p
p
ff
ff
ff
F
......
............
......
......
1
221
111
。
设样本协差阵为R,则FF
1-n
1
RT,若
j
d为R的第j个彼此正交的单位特征向量,则
pijFdT
jj
,...,
是第j个主成分得分所成向量。用分块矩阵写成
]...[]...,[
11pp
ddFTT
由特征向量正交性,通过分块矩阵乘法易证
T
T
p
1
p1
d
...
d
T...TF,
于是,由分块矩阵乘法
T
j
j
d
p
1
j
TF。称
j
d为第j个空间函数,
j
d彼此正交。每一个
j
d是
一个空间正交
经验函数。和主成分分析一样,通常只取少数前k个
j
T和
j
d,
T
j
k
j
j
dTF
1
,即F可用少
数空间正交经验
函数描述。
我们也可以考虑时间正交经验函数,令
p
1k
ikijij
t
p
1
tg,设
ij
g形成矩阵
np
f......g
............
g......g
g......g
G
p1
2n21
1n11
。
设GG
1-p
1
HT,若
j
c为H的第j个彼此正交的单位特征向量,则
n,...,2,1GcLj
jj
,
是第j个主成分得分所成向量。用分块矩阵写成
]c...c[G]L...,L[
n1n1
由特征向量正交性,通过分块矩阵乘法易证
T
T
n
1
n1
c
...
c
L...LG,
于是,由分块矩阵乘法
T
j
j
j
cLGn
1
。称
j
c为第j个时间函数,
j
c彼此正交。每一个
j
c是
一个正交经验函数。和主成分分析一样,通常只取较小的数k,只考虑前k个
j
c和
j
L,
T
j
k
j
j
cLG
1
,
即G可用少数时间正交经验函数描述。
上述分解称为正交经验函数分解,也称为正交经验分解。
容易看出,正交经验函数分解与主成分分析的计算相同:若有q行p列数据矩阵F,将矩
阵q个行变量作为p维随机变量的q次观测值,用协方差阵做主成分分析。各个主成分的系数
构成p个正交经验函数,p个主成分得分就是正交经验函数的系数。
正交经验(函数)分解可以用SAS-PRINCOMP过程完成,例5.13说明其过程。
例5.13设某三气象站(a,b,c)五次气温观测如表4-11,试做经验正交函数分解。
表5-11三气象站五次气温观测值
abc
101214
91112
91213
101214
81113
用空间经验函数分解,可采用下列程序
datatemperat;
inputabc;
cards;
101214
91112
91213
101214
81113
;
procprincompcovout=w1;
varabc;
procprintdata=w1;
run;
执行该程序后,得到的主要输出为
SimpleStatistics
ABC
Mean9.2.6000000013.20000000
StD0.8366600270.547722560.83666003
上表指出三地气温平均值分别是9.2,11.6,13.2;因而距平F应是观测值减去此三数,即{
ij
f}
为
0.80.40.8
-0.2-0.6-1.2
-0.20.4-0.2
0.80.40.8
-1.2-0.6-0.2
Eigenvectors
PRIN1PRIN2PRIN3
A0.6425420.707107-.295194
B0.4174680.0000000.908692
C0.642542-.707107-.295194
上表给出三个特征向量,它们是彼此正交的经验函数(空间函数)。第一特征向量
)'0.642542,0.417468,0.642542(x
1
中各分量相差不多,反映三地气温偏高的特征;第
二特征向量)'-0.707107,0.00000,((707107x
2
中第1,3个分量绝对值相同,符号相反,
反映a地气温偏高,c地气温偏低的趋势。第三特征值很小,第三特征向量不考虑
OBSABCPRIN1PRIN2PRIN3
11012141.195050.00000-0.10883
291112-1.150040.70711-0.13194
391213-0.09003-0.000000.48155
41012141.195050.00000-0.10883
581113-1.15004-0.70711-0.13194
上表给出'19505.119505.1,09003.0,19505.1,19505.1(
1
),T,
)'70711.0,00000.0,00000.0,70711.0,00000.0(
2
T
说明第一特征在第1,4年较强(1、4年气温偏高),第2,5年反方向较强(2、5年气温偏低),
第三年很弱(气温接近平均值);第二特征第二年较强(a地气温距平比c地高),第五年反
方向较强(c地气温比a地高)。
若想得到时间正交经验分解,则须把5年的观测值作为5个随机变量,a、b、c,3地
代表不同观测次数,就可得另一分解。这时
采用程序
datatemperat;
inputyear1-year5;
cards;
1099108
1211121211
1412131413
;
procprincompcovout=w2;
varyear1-year5;
procprintdata=w2;
run;
执行程序后得到的主要结果是
Eigenvectors
PRIN1PRIN2PRIN3PRIN4PRIN5
YEAR10.436094-.465405-.0371970.3030460.707107
YEAR20.3342450.2403580.9113220.0000000.000000
YEAR30.4504420.713419-.3533700.4040610.000000
YEAR40.436094-.465405-.0371970.303046-.707107
YEAR50.5522920.007656-.204583-.8081220.000000
这5个特征向量就是彼此正交的时间经验函数。由上表可见第1特征向量是
)'552295.0,436094.0,450442.0,334245.0,436094.0(
1
X反映五年气温偏高特
征;第2特征向量是)'007656.0,465405.0,713419.0,240358.0,465405.0(
2
X,反映
第三年偏高,第1,4年偏低特征。
OBSYEAR1YEAR2YEAR3YEAR4YEAR5PRIN1PRIN2PRIN3PRIN4PRIN5
11099108-4.82526-0.22404000
2.595810.55828000
3.22945-0.33425000
上表表明)22945.4,59581.0,82546.4(L
1
,中-4.82546和4.22945绝对值相对很
大符号一正一负,0.59581绝对值相对偏小,说明第一特征在a,c地突出,a地5年温度偏
低,c地5年温度偏高;)'33425.0,55828.0,22404.0(L
2
。0.55828绝对值相对
偏大,这说明第二特征在b地突出,b地第三年偏高,第1,4年偏低。
例5.14已知山楂园昆虫群落调查表如表5-12,其中x1-x16表示不同时刻昆虫的群
落x1:桃蚜,x2:山楂木虱,x3:草履蚧,x4:山楂叶螨,x5:梨网蝽,x6:黑绒金龟子,x7:
苹毛金龟子,x8:顶梢叶蛾,x9:苹小卷叶蛾,x10:金纹细蛾,x11:舟形毛虫,x12:山楂粉
蝶,x13:桃小食心虫,x14:梨小食心虫,x15:白小食心虫,x16:桑天蛾。T表示时段。做经
验正交函数分解。
表5-12山楂园昆虫群落调查表
tx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16
100090
224187000
33295313220050
46758622100340070
52665330
6382640
72130
81176
98
3726196
89815
127
93230
0
1500000
0000
解设变量x1-x16表示各种昆虫群落,程序为
datainsect;
inputtx1-x16;
cards;
100090
224187000
33295313220050
46758622100340070
52665330
6382640
72130
81176
98
3726196
89815
127
93230
0
1500000
160000
;
procprincompcovn=4out=eo;
varx1-x16;
run;
procprintdata=eo;
varprin1-prin4;
run;
执行后的主要输出为
Eigenvalues
EigenvalueDifferenceProportionCumulative
PRIN184013.743593.60.5386760.538676
PRIN240420.122524.40.2591640.797841
PRIN317895.713220.90.1147430.912584PRIN4
4674.9.0.0299740.942558
以上是协方差阵的前4个特阵值,累积比例0.9426表明只要取4个特阵向量-经验正交函数就
够了。
Eigenvectors
PRIN1PRIN2PRIN3PRIN4
X10.6157500.2680840.105590-.024590
X20.0757760.000319-.1925030.662614
X30.009459-.008403-.0304540.098534
X4-.0651800.1232610.0421560.037380
X5-.3690400.888579-.092851-.056854
X60.3798580.1634050.080058-.139139
X70.5317130.2497750.0989180.059858
X80.0098340.006608-.0461470.139279
X9-.0065960.005590-.0248130.099052
X10-.1200400.0658800.218413-.357796
X11-.0863550.0342700.5243090.237510
X120.0392510.005635-.0181960.074936
X13-.107747-.0445230.7659760.047768
X14-.1086120.1557970.0985430.531365
X15-.0239710.046099-.046134-.138075
X16-.0065500.0142580.0059160.010357
以上是前4个特征向量,即4个主要正交经验函数。第1特征向量中x1、x7系数大,x5负系数
大,表明桃蚜和苹毛金龟子高发而梨网蝽低发;第2特征向量中x5系数大,表明梨网蝽高发,
第3征向量中x11,x13系数大,表明舟形毛虫和桃小食心虫高发;第4特征向量中x2,x14系数
大,表明山楂木虱,梨小食心虫高发。
OBSPRIN1PRIN2PRIN3PRIN4
1-31.461-188.305-73.085-29.684
2-8.182-179.016-70.590-23.294
3369.967-4.305-8.985-31.552
4859.690254.04973.040-13.554
5279.934-0.902-44.50851.884
6-3.851-75.585-112.096114.183
7-79.377-59.232-104.294112.322
8-129.83622.840-79.17344.987
9-203.233184.630-87.230-95.750
10-304.774407.810-62.373-98.047
11-306.690370.82843.951100.632
12-176.478-25.204346.23128.616
13-121.268-154.751278.666-18.080
14-68.164-175.74233.614-58.972
15-40.817-186.957-61.825-49.482
16-35.460-190.160-71.343-34.209
上表为主成分得分。Prin1得分中3,4,5时段得分数大,表明该三个时段桃蚜和苹毛金龟
子高发高发;Prin2得分中4,9,10,11时段得分数大,表明上述时段梨网蝽高发(与实
际情况略有差异,因有第1主成分影响);Prin3得分中第12,13时段得分大,表明这2
时段舟形毛虫和桃小食心虫高发Prin4得分中6,7,10时短得分大,表明这三时段山楂木
虱,梨小食心虫高发。
应用实例
例1利用武汉市2005年五月份的每天的各监测站平均的SO2监测值与每天早上八点钟的风
力、气温、三小时降水作主分量分析。
解:采用以下程序
datatemperat;
inputdateemo_so2dd_08t_08r_08;
cards;
2005050122.43123.0018.00
2005050245.57220.400.00
2005050370.14122.300.00
2005050447.14222.907.00
2005050542.29123.000.10
2005050634.57117.800.00
2005050746.14120.000.00
2005050827.86121.600.00
2005050945.57121.900.00
2005051053.57124.200.00
2005051143.29224.100.00
2005051254.71225.200.00
2005051349.29023.800.70
2005051426.57220.700.00
2005051520.71220.000.00
2005051626.14122.0013.00
2005051718.29220.3018.00
2005051830.29220.200.00
2005051933.00220.400.20
2005052046.14020.900.00
2005052124.14119.205.00
2005052226.14121.100.00
2005052342.71023.300.30
2005052445.00022.300.00
2005052548.29224.100.00
2005052652.86223.800.00
2005052752.29024.300.00
2005052833.43125.300.00
2005052944.14224.300.00
2005053054.86227.600.00
2005053130.71125.900.00
;
procprincompcovout=prin;
vardd_08t_08r_08;
procprintdata=prin;
run;
执行上述程序,输出的结果是:
EigenvaluesoftheCovarianceMatrix
EigenvalueDifferenceProportionCumulative
125.667416820.93505960.82990.8299
24.73235734.20413130.15300.9829
30.52822590.01711.0000
Eigenvectors
Prin1Prin2Prin3
dd_080.0107590.0033340.999937
t_08-.0747310.997201-.002521
r_080.9971460.074699-.010978
上表给出所考察变量样本协差阵的特征值25.6674168、4.7323573、0.5282259和特征向
量(0.010759,-0.074731,0.997146)、(0.003334,0.997201,0.074699)和(0.999937,
-0.002521,-0.010978)。因此第一、二、三主分量分别是
prin1=0.010759*dd_08-0.074731*t_08+0.997146*r_08,
prin2=0.003334*dd_08+0.997201*t_08+0.074699*r_08,
prin3=0.999937*dd_08-0.002521*t_08-0.010978*r_08
由于第1个主分量累计贡献为0.8299,前两个第2个主分量所占比例为0.9829,所以对
三个主分量只需要考虑前两个。
例2
资料:江苏省沿江七市1954—2002年梅雨量。资料说明:NJ-南京,SZ-苏州,WX-无锡,CZ-
常州,ZJ-镇江,YZ-扬州,NT-南通。所用资料省略了小数位,6234表示623.4毫米。
方法:主分量分析。
程序主体如下:
DATAMEIYU;
TITLE'MEIYUINSELECTEDCITIES';
INPUTYEARNJSZWXCZZJYZNT;
CARDS;
827993300
1955212
4254502967
19572611
2628
746969
529931
1231017
933872341
115481470
5881203
423791474988894
61961602
312672287
327061756
19696796
19765934
114111019
1972448
7913851347
530342718
239364951
52533289
3227
415
26492619
1983341469
717932513
134283287
822893374299322082520
723772569
6257822682
024084870
435463905
966588251199511
524381403
18211553
567778339821777047349
62739
525231937
1887
429582493222314282919
447803060
006884
921731501
40
251552
2399272272242334
228222236873821905
;
procprincompcovoutstat=prinout=temp1;
VARNJSZWXCZZJYZNT;
procprintdata=temp1;
run;
结果分析:
第一主分量解释了总方差的80.7%,第二主分量解释了12.7%,二者共解释了总方差的
93.3%。故只需分析第一和第二主分量就可概括江苏沿江地区梅雨的时空演变情况。
从第一主分量可以看出,第一特点是量值均为正数,表明江苏沿江地区梅雨量异常具有
同步性,第二特点是宁、镇、扬地区为高值区,苏、锡、常、通为低值区。结合第一主分量
时间系数曲线(图1)中存在的年际振荡的趋势,江苏沿江地区梅雨似乎存在一个振幅周期
约3~4年的振荡。图中黑粗线为三阶多项式回归趋势线。可以看出,上世纪50年代末至
70年代初为梅雨量较少的时期,70年代中后期至90年代初梅雨量偏多,1993年以来,梅
雨量偏少,1999年以后似乎有回升的趋势。由于资料只到2002年,但我们知道2003年为
我省历史罕见的梅雨量异常多年,因此也印证了这种变化趋势。
从第二主分量可以看出,宁、镇、扬地区为负值,苏、锡、常、通地区正值,表明这两
个地区梅雨季节旱涝异常呈反位相。即当宁、镇、扬地区偏旱时,苏、锡、常、通地区偏涝;
当宁、镇、扬地区偏涝时,苏、锡、常、通地区偏旱。也就是东旱西涝或东涝西旱。结合第
二主分量时间系数曲线(图2)可以看出上世纪80年代末以前,以负系数为主(即东旱西涝),
而90年代至今以正系数为主(既东涝西旱)。
综上可将江苏省沿江地区梅雨降水分成四种型:旱型、涝型、东旱西涝型、东涝西旱型。
例3全国14个城市的春夏秋冬四季的平均季雷暴日如下表,试用主分量分析方法分析。
SpringSummerAutumnWinter
Shanghai6.919.75.40.3
Ganyu5.129.73.40.1
Xuzhou5.526.12.90.1
Sheyang5.727.85.30.2
Qingjiang6.029.54.10.4
Xuyi6.333.04.10.2
Dongtai5.429.54.80.4
Gaoyou5.730.15.80.4
Qidong6.423.45.80.2
Nantong6.425.75.30.3
Nanjing7.425.93.20.4
Wujin7.726.64.20.3
Liyang9.131.75.40.5
Wuxian10.029.67.20.4
Sas程序如下:
datatemperat;
inputcitySpringSummerAutumnWinter;
cards;
Shanghai6.919.75.40.3
Ganyu5.129.73.40.1
-10000
-5000
0
5000
10000
15000
20000
4941999
图1江苏省沿江地区梅雨降水第一主分量的时间系数
-5000
-4000
-3000
-2000
-1000
0
1000
2000
3000
4000
4941999
图2江苏省沿江地区梅雨降水第二主分量的时间系数
Xuzhou5.526.12.90.1
Sheyang5.727.85.30.2
Qingjiang6.029.54.10.4
Xuyi6.333.04.10.2
Dongtai5.429.54.80.4
Gaoyou5.730.15.80.4
Qidong6.423.45.80.2
Nantong6.425.75.30.3
Nanjing7.425.93.20.4
Wujin7.726.64.20.3
Liyang9.131.75.40.5
Wuxian10.029.67.20.4
;
procprincompdata=temperatcovoutstat=prin;
varSpringSummerAutumnWinter;
procprintdata=prin;
run;
运行结果如下:
EigenvaluesoftheCovarianceMatrix
EigenvalueDifferenceProportionCumulative
112.12309309.47492280.77720.7772
22.64817021.83097510.16980.9470
30.81719510.80774130.05240.9994
40.00945380.00061.0000
Eigenvectors
Prin1Prin2Prin3Prin4
Spring0.0320460.818308-.572537-.039334
Summer0.999396-.0203880.027273-.006921
Autumn-.0108230.5727200.819409-.021051
Winter0.0079570.044148-.0050870.998980
结果分析:
上表给出协差阵的四个特征值分别为:
λ1=12.1230930
λ2=2.6481702
λ3=0.8171951
λ4=0.0094538
对应得特征向量分别为:
(0.032046,0.999396,-.010823,0.007957)
(0.818308,-.020388,0.572720,0.044148)
(-.572537,0.027273,0.819409,-.005087)
(-.039334,-.006921,-.021051,0.998980)
由此可得:
Prin1=0.032046Spring+0.818308Summer-.572537Autumn-.039334Winter
Prin2=0.999396Spring-.020388+Summer+0.027273Autumn-.006921Winter
Prin3=-.010823Spring+0.572720Summer+0.819409Autumn-.021051Winter
Prin4=0.007957Spring+0.044148Summer-.005087Autumn+0.998980Winter
Spring和Summer两个主分量解释了总方差的94.7%,可见只要用Spring和Summer两个主
分量就能很好的概括这组数据。
例4对徐州,丰县和沛县三个地区从1961年到2000年逐年降水资料进行主分量分析。单
位:mm
程序如下:
datarain;
inputyearc1c2c3;
cards;
1961727.8586.8591.5
19621013.11285.41049.6
19631213.4855.91367.4
1964969.51103.31109.9
19651633.91158.61144.5
1966611.7462.7570.7
1967756.3842.7729.9
1968638.9486.7550.0
1969969.6779.6744.1
1970793.6792.9721.7
1971887.6801.31178.9
1972873.0673.0671.4
1973808.5899.7865.9
19741021.3848.2897.9
19751065.1639.21450.0
1976784.81335.4630.2
1977695.3747.4742.8
19782123.21610.3628.0
1979946.4922.3959.2
1980776.51260.0646.9
1981612.2418.6492.4
19821017.3558.9546.3
1983950.3563.7556.5
1984952.3818.1657.3
1985966.61098.3992.1
1986681.6431.2630.7
1987620.5694.5686.6
1988500.6352.0425.9
1989692.7541.6632.6
19901088.8803.1833.4
1991781.9664.4649.9
1992888.6875.5899.2
1993836.4800.2745.8
1994821.11533.9609.9
1995825.3651.3719.7
1996965.8673.0659.7
1997794.3427.9591.0
19981128.9725.6819.9
1999641.9462.4571.6
2000979.6630.4865.4
;
procprincompdata=raincovoutstat=prin;
varc1c2c3;
procprintdata=prin;
例5回归分析
一资料选取及分析方法简介
资料选取的是1951---1980年(计29年)的气象资料。该回归方程的目的是研究长江
中下游六月份降水量Y与六个预报因子X1---X6之间的相关关系。其中,X1:12月份北半
球极涡强度。X2:9月份40°N,90°W、100°W两点500hPa高度平均。X3:10月份65°
N,150°E-170°W五点500hPa高度平均。X4:12月份180°E-160°W三点500hPa高度平
均。X5:2月份80°N、150°W-130°W;70°N、150°W-130°W六点500hPa高度平均。X6:
9月份月平均温度减20度。以上因子X1—X6中凡是高度平均均以减去5000位势米。(注:
该资料中的数据已标准化)
分析方法是正交组合的多元回归。该方法是将M个因子正交组合成M个新的因子(即相
互独立的主成份),然后利用正交组合因子建立多元回归方程。具体做法:用SAS的PROC
PRINCOMP过程来分析组合因子,这里组合因子为主分量分析中的主成份,用主成份来建立
回归的好处:一、由于组合因子彼此正交,建立回归方程时特别简单,回归系数可以独立计
算。而且由于信息的重新组合分配,使预报因子减少成为可能。二、因为组合因子和预报对
象的相关系数与原变量与预报对象的相关系数有很大的不同。三、组合因子能利用到各个原
始变量的信息,这与逐步回归有很大不同。把PROCPRINCOMP得到的主成份与预报对象组成
新的数据来用正交回归PROCORTHOREG过程建立回归方程,同时根据组合因子与预报对象Y
的相关系数来对预报因子进行取舍。
二程序
(一)组合因子的产生-----PROCPRINCOMP过程
1程序PROCPRINCOMP
DATAchang;
TITLE'长江中下游6月份降水量的Y与X1,X2,X3,X4,X5,X6的关系之组合因子分析';
INPUTx1x2x3x4x5x6;
cards;
-0.362-0.684-0.6721.3050.529-0.548
1.8341.0042.328-1.685-0.4400.284
-2.0690.1601.256-1.570-2.217-0.363
-1.0941.8482.1130.385-0.1861.208
0.8581.004-0.2441.880-1.5712.040
-0.1180.582-1.1010.2700.529-0.271
0.126-1.106-0.030-0.4201.337-1.842
1.3460.582-1.101-0.075-1.086-0.641
-0.362-1.1061.256-0.6502.3060.191
-0.6061.004-0.244-0.535-1.0860.746
0.126-1.1060.1850.6151.3370.284
0.370-1.106-0.0300.6150.8520.284
-0.3622.270-0.672-0.880-1.2470.653
0.6140.160-0.4581.5350.3681.670
0.370-1.106-0.8870.960-0.117-0.918
0.126-1.106-0.887-0.190-0.440-1.295
2.078-0.684-1.5300.615-0.2780.099
-0.118-1.528-0.2441.075-0.1170.469
1.3460.1600.613-1.6851.6601.116
0.6140.1600.1850.500-0.601-0.271
-1.582-0.262-1.3151.075-0.117-0.548
-0.3620.160-0.6720.6150.045-1.288
-0.3620.160-0.458-1.110-0.601-1.658
-1.094-1.1061.042-1.5700.206-0.179
-1.826-1.106-1.315-0.650-0.2782.040
1.346-0.2620.613-1.340-0.117-1.011
0.1260.5821.2560.6151.014-0.086
-0.3621.4260.1850.1550.2060.653
-0.6061.0040.8280.1551.014-0.918
;
PROCPRINCOMPdata=changoutstat=prin;
varx1x2x3x4x5x6;
procprintdata=prin;
run;
2PROCPRINCOMP过程产生的新的组合因子
-1.4925-0.6176-0.29760.37460.68700.153979.5
2.18921.84331.6778-1.1641-0.37730.6343293.7
2.29390.7754-2.49810.2298-0.45341.0176389.2
2.9734-0.36860.19471.10601.04230.7352298.9
1.1684-2.84551.37770.0285-0.0918-0.7724274.3
-0.4992-0.5474-0.1514-.21400.4159-1.0957172.7
-1.78371.6799-0.4490-0.40520.4572-.138055.5
0.0030-0.83120.0536-2.0415-0.2565-0.1761221.8
-0.85532.14600.75101.6607-0.0109-0.2100211.9
1.5047-0.7266-0.56840.0152-0.4221-0.4042191.9
-1.32590.37840.75881.00340.02420.1837156.5
-1.23780.07010.69340.6076-0.20170.343086.3
2.2302-1.0085-.5056-0.7373-.1537-1.2968305.5
-0.2875-1.66291.54550.8072-0.0486-0.0495137.4
-1.6780-0.5551-0.3623-0.56750.11800.5770137.5
-1.22630.1116-1.0241-0.9073-0.43900.3318130.4
-1.4191-1.16841.0846-1.4798-0.91610.078771.6
-1.1812-0.72490.02850.7725-0.45921.0031139.4
0.33571.63121.94450.1894-1.0951-1.1240197.2
0.1120-0.37020.2667-0.66900.29330.6062218.0
-0.9717-1.1936-1.59040.60610.6363-0.3143182.1
-0.7718-0.2807-0.8398-0.53740.9663-0.2113159.8
0.03240.7587-1.5532-1.26580.0750-0.3341155.3
0.25751.8429-0.99760.8823-0.80080.3803268.4
-0.0028-1.0067-1.15701.9727-2.0726-0.6394243.3
0.05481.55140.3334-1.5011-0.46950.3055143.4
0.21390.63850.97570.55811.26990.1265115.1
1.0972-0.42690.37750.42170.6268-0.7700238.2
0.27260.9013-0.07230.28441.6478-0.5158219.1
(二)组合因子建立回归方程----PROCORTHOREG
datapreci;
inputz1-z6y;
cards;
-1.4925-0.6176-0.29760.37460.68700.153979.5
2.18921.84331.6778-1.1641-0.37730.6343293.7
2.29390.7754-2.49810.2298-0.45341.0176389.2
2.9734-0.36860.19471.10601.04230.7352298.9
1.1684-2.84551.37770.0285-0.0918-0.7724274.3
-0.4992-0.5474-0.1514-.21400.4159-1.0957172.7
-1.78371.6799-0.4490-0.40520.4572-.138055.5
0.0030-0.83120.0536-2.0415-0.2565-0.1761221.8
-0.85532.14600.75101.6607-0.0109-0.2100211.9
1.5047-0.7266-0.56840.0152-0.4221-0.4042191.9
-1.32590.37840.75881.00340.02420.1837156.5
-1.23780.07010.69340.6076-0.20170.343086.3
2.2302-1.0085-.5056-0.7373-.1537-1.2968305.5
-0.2875-1.66291.54550.8072-0.0486-0.0495137.4
-1.6780-0.5551-0.3623-0.56750.11800.5770137.5
-1.22630.1116-1.0241-0.9073-0.43900.3318130.4
-1.4191-1.16841.0846-1.4798-0.91610.078771.6
-1.1812-0.72490.02850.7725-0.45921.0031139.4
0.33571.63121.94450.1894-1.0951-1.1240197.2
0.1120-0.37020.2667-0.66900.29330.6062218.0
-0.9717-1.1936-1.59040.60610.6363-0.3143182.1
-0.7718-0.2807-0.8398-0.53740.9663-0.2113159.8
0.03240.7587-1.5532-1.26580.0750-0.3341155.3
0.25751.8429-0.99760.8823-0.80080.3803268.4
-0.0028-1.0067-1.15701.9727-2.0726-0.6394243.3
0.05481.55140.3334-1.5011-0.46950.3055143.4
0.21390.63850.97570.55811.26990.1265115.1
1.0972-0.42690.37750.42170.6268-0.7700238.2
0.27260.9013-0.07230.28441.6478-0.5158219.1
;
procorthoregdata=preci;
modely=z1-z6;
run;
三回归结果及分析
(一)PROCPRINCOMP过程结果及分析:
1结果
长江中下游6月份降水量的Y与X1,X2,X3,X4,X5,X6的关系之组合因子分析13
20:41Sunday,May3,1998
PrincipalComponentAnalysis
29Observations
6Variables
SimpleStatistics
X1X2X3
Mean-0.000172414-0..000000000
StD1...017603979
X4X5X6
Mean0..-0.003448276
StD1..99704825321.022061348
长江中下游6月份降水量的Y与X1,X2,X3,X4,X5,X6的关系之组合因子分析14
20:41Sunday,May3,1998
PrincipalComponentAnalysis
CorrelationMatrix
X1X2X3X4X5X6
X11.00000.0118-.02700.06360.1135-.0144
X20.01181.00000.2637-.0867-.37340.2639
X3-.02700.26371.0000-.41870.18880.1039
X40.0636-.0867-.41871.00000.05680.1658
X50.1135-.37340.18880.05681.0000-.0929
X6-.01440.26390.10390.1658-.09291.0000
长江中下游6月份降水量的Y与X1,X2,X3,X4,X5,X6的关系之组合因子分析15
20:41Sunday,May3,1998
PrincipalComponentAnalysis
EigenvaluesoftheCorrelationMatrix
EigenvalueDifferenceProportionCumulative
PRIN11.600370.1915480.2667280.26673
PRIN21.408820.3076110.2348040.50153
PRIN31.101210.1480790.1835350.68507
PRIN40.953130.3736490.1588550.84392
PRIN50.579480.2225020.0965810.94050
PRIN60.35698.0.0594971.00000
长江中下游6月份降水量的Y与X1,X2,X3,X4,X5,X6的关系之组合因子分析16
20:41Sunday,May3,1998
PrincipalComponentAnalysis
Eigenvectors
PRIN1PRIN2PRIN3PRIN4PRIN5PRIN6
X1-.122551-.0313110.585940-.779020-.1643880.082339
X20.6127440.2785840.101450-.1933060.538883-.457036
X30.500793-.4925720.2625760.1588300.2080340.607560
X4-.3991540.5260260.3024650.1990080.5286080.391735
X5-.330872-.5047860.4700050.3226400.236233-.504883
X60.2998870.3837580.5148270.432242-.551426-.082743
长江中下游6月份降水量的Y与X1,X2,X3,X4,X5,X6的关系之组合因子分析17
20:41Sunday,May3,1998
OBS_TYPE__NAME_X1X2X3X4X5X6
1MEAN-0.0002-0.00010.00000.00030.0312-0.0034
2STD1.01781.01761.01761.01760.99701.0221
3N29.000029.000029.000029.000029.000029.0000
4CORRX11.00000.0118-0.02700.06360.1135-0.0144
5CORRX20.01181.00000.2637-0.0867-0.37340.2639
6CORRX3-0.02700.26371.0000-0.41870.18880.1039
7CORRX40.0636-0.0867-0.41871.00000.05680.1658
8CORRX50.1135-0.37340.18880.05681.0000-0.0929
9CORRX6-0.01440.26390.10390.1658-0.09291.0000
10EIGENVAL1.60041.40881.10120.95310.57950.3570
11SCOREPRIN1-0.12260.61270.5008-0.3992-0.33090.2999
12SCOREPRIN2-0.03130.2786-0.49260.5260-0.50480.3838
13SCOREPRIN30.58590.10150.26260.30250.47000.5148
14SCOREPRIN4-0.7790-0.19330.15880.19900.32260.4322
15SCOREPRIN5-0.16440.53890.20800.52860.2362-0.5514
16SCOREPRIN60.0823-0.45700.60760.3917-0.5049-0.0827
2结果分析
在表EIGENVALUESOFTHECOVARIANCEMATRIX中给出协方差阵的六个特征值分别为:
λ1=1.6004,λ2=1.4088,λ3=1.1012,λ4=0.9531,λ5=0.5795,λ6=0.3570,
对应的特征向量分别为:
t1=(-0.1226,0.6127,0.5008,-0.3992,-0.3309,0.2999)
t2=(-0.0313,0.2786,-0.4926,0.5260,-0.5048,0.3838)
t3=(0.5859,0.1015,0.2626,0.3025,0.4700,0.5148)
t4=(-0.7790,-0.1933,0.1588,0.1990,0.3226,0.4322)
t5=(-0.1644,0.5389,0.2080,0.5286,0.2362,-0.5514)
t6=(0.0823,-0.4570,0.6076,0.3917,-0.5049,-0.0827)
由此可以得到第一主分量
PRIN1=-0.1226X1+0.6127X2+0.5008X3-0.3992X4-0.3309X5+0.2999X6
第二、三个等主分量类似可得。再将观测值代入主分量即得到新的组合因子(组合因
子即主成份)。
(二)PROCORTHOREG过程结果:
1结果
TheSASSystem21:49Sunday,May3,19984
ORTHOREGRegressionProcedure
DependentVariableY
SumofSquaredErrors38506.064031
DegreesofFreedom22
MeanSquaredError1750.2756378
RootMeanSqrError41.836295699
R-square0.7803532718
VariableDFParameterEstimateStdErrorT-RatioProb>|t|
INTERCEP1189.4.8.280.0001
z1150.68782492178346..370.0001
z211.374.7049073770.200.8395
z31-14.687.5323693735-1.950.0641
z4111.219.1.380.1802
z51-13.798983212231910.340164965-1.330.1957
z61-0.27313.242646135-0.020.9871
2SAS结果分析
由上述程序的结果我们可以看出正交回归统计分析ORTHOREG过程给出两个表:方差
分析表、参数估计表。
方差分析表:残差平方和SE=38506.064031也给出自由度为22,则平均平方
和MSE=SE/22=1750.2756378。表中还给出复相关系数平方R2=0.7803532718。
参数估计表:它给出常数项和回归系数分别是189.4
50.68782492178341.374-14.6811.219
-13.7989832122319-0.273;它们的标准误差是7.8.
6.7049073777.53236937358.1.34016496513.242646135。t值为:
24.288.370.20-1.951.38-1.33-0.02并给出t分布随机变量大于它们的概率
是0.00010.00010.83950.06410.18020.19570.9871。前两个概率小于0.05,
说明截距和Z1的作用明显,不能舍去。
由参数估计表的参数估计值结果可以得到回归方程:
Y=189.41+50.69Z1+1.38Z2-14.69Z3+11.22Z4-13.80Z5-0.22Z6
3物理意义分析
原预报因子与预报对象的相关系数Riy为-.36410.54920.4458-.4197-.5036
0.3761(Riy的求法:将原预报因子X1-X6与预报对象Y建立一个数据集从而可用PROC
PRINCOMP过程来计算,该过程的其中一个结果为各个因子之间的相关系数,这样可从中提
取出X1-X6与Y的相关系数。)可见,预报因子与预报对象的关系为:X1:12月份北半球
极涡强度,X4:12月份180°E-160°W三点500hPa高度平均及X5:2月份80°N、150°W-130°
W;70°N、150°W-130°W六点500hPa高度平均这三个因子与预报对象Y是反相关。而X2:
9月份40°N,90°W、100°W两点500hPa高度平均,X3:10月份65°N,150°E-170°W
五点500hPa高度平均,X6:9月份月平均温度减20度这三个因子与预报对象正相关。组合
因子与预报对象的相关系数R*
iy为0.83940.0203-.19610.1393-.1332-.0422
(R*
iy的求法与Riy的求法相同,不同的是原预报因子变为组合因子Z1-Z6来建立数据集)组
合因子Z1、Z2、Z4与预报因子呈正相关,组合因子Z3、Z5及Z6与预报对象呈负相关。(注:
由于篇幅有限这里不打印出这两个相应的相关矩阵)。
由上述的结果可以看出:原预报因子与预报对象的关系都非常密切,相关系数相差不
多,这是因为原预报因子之间有正交、彼此不相互独立。而组合因子与预报对象的关系则大
不相同,第一个组合因子(即第一主分量)明显与预报因子的相关系数达到0.839,所以用
组合因子,只要选取一个组合因子就可以预报六月份长江中下游的降水。可见用组合因子建
立回归方程,预报方程的显著性大大提高了。
根据参数估计的t检验及气象中应用的相关系数R*
iy可以看出组合因子中的Z1对预报
方程的贡献最大,故方程可简化为:
Y=189.41+50.69Z1
根据预报对象Y及预报因子Z1画图,则图象如下:
如上图,Y与Z1近似呈线性且斜率为正值。故可以看出用一个组合因子Z1就可以很好
的建立这个线性回归方程。
四总结
回归分析是气象研究和预报中经常使用的相关方法。我们可以用回归方法来建立相关预
报方程(降水、气温、体感温度、湿度等预报方程)。方法可采用逐步回归法、选择最优的
变量子集法、正交组合的多元回归等,本文采用正交组合的多元回归。该方法简化了方程、
提高了预报的显著性。
本文采用了1951---1980年(计29年)的气象资料(五个高度长资料一个温度长资料)
及正交组合的多元回归方法调用SAS中的PROCPRINCOMP和PROCORTHOREG来建立相应的
回归方程。从而经分析简化最终得到方程:
Y=189.41+50.69Z1,
其中,Z1用原始变量表示为:Z1=-0.1226X1+0.6127X2+0.5008X3-0.3992X4-0.33
09X5+0.2999X6