本文主要介绍了数据处理方面的内容,希望大家仔细阅读。
一、数据分析
得到了以下列字符串开头的文本数据,我们需要进行处理
二、回滚
我们需要对httperror的数据进行再处理
因为代码的原因,具体可见本系列文章(二),会导致文本里面同一个id连续出现几次httperror记录:
1
2
3
4
5
6
7
8
9
|
/ / httperror265001_266001.txt 265002 httperror 265002 httperror 265002 httperror 265002 httperror 265003 httperror 265003 httperror 265003 httperror 265003 httperror |
所以我们在代码里要考虑这种情形,不能每一行的id都进行处理,是判断是否重复的id。
java里面有缓存方法可以避免频繁读取硬盘上的文件,python其实也有,可以见这篇文章。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
def main(): reload (sys) sys.setdefaultencoding( 'utf-8' ) global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5 sexRe = re. compile (u 'em>\u6027\u522b</em>(.*?)</li' ) timeRe = re. compile (u 'em>\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4</em>(.*?)</li' ) notexistRe = re. compile (u '(p>)\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<' ) url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s' url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile' file1 = 'ruisi\\correct_re.txt' file2 = 'ruisi\\errTime_re.txt' file3 = 'ruisi\\notexist_re.txt' file4 = 'ruisi\\unkownsex_re.txt' file5 = 'ruisi\\httperror_re.txt' #遍历文件夹里面以httperror开头的文本 for filename in os.listdir(r 'E:\pythonProject\ruisi' ): if filename.startswith( 'httperror' ): count = 0 newName = 'E:\\pythonProject\\ruisi\\%s' % (filename) readFile = open (newName, 'r' ) oldLine = '0' for line in readFile: #newLine 用来比较是否是重复的id newLine = line if (newLine ! = oldLine): nu = newLine.split()[ 0 ] oldLine = newLine count + = 1 searchWeb(( int (nu),)) print "%s deal %s lines" % (filename, count) |
本代码为了简便,没有再把httperror的那些id分类,直接存储为下面这5个文件里
1
2
3
4
5
|
file1 = 'ruisi\\correct_re.txt' file2 = 'ruisi\\errTime_re.txt' file3 = 'ruisi\\notexist_re.txt' file4 = 'ruisi\\unkownsex_re.txt' file5 = 'ruisi\\httperror_re.txt' |
可以看下输出Log记录,总共处理了多少个httperror的数据。
1
2
3
4
5
|
"D:\Program Files\Python27\python.exe" E: / pythonProject / webCrawler / reload .py httperror132001 - 133001.txt deal 21 lines httperror2001 - 3001.txt deal 4 lines httperror251001 - 252001.txt deal 5 lines httperror254001 - 255001.txt deal 1 lines |
三、单线程统计unkownsex 数据
代码简单,我们利用单线程统计一下unkownsex(由于权限原因无法获取、或者该用户没有填写)的用户。另外,经过我们检查,没有性别的用户也是没有活动时间的。
数据格式如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
253042 unkownsex 253087 unkownsex 253102 unkownsex 253118 unkownsex 253125 unkownsex 253136 unkownsex 253161 unkownsex import os,time sumCount = 0 startTime = time.clock() for filename in os.listdir(r 'E:\pythonProject\ruisi' ): if filename.startswith( 'unkownsex' ): count = 0 newName = 'E:\\pythonProject\\ruisi\\%s' % (filename) readFile = open (newName, 'r' ) for line in open (newName): count + = 1 sumCount + = 1 print "%s deal %s lines" % (filename, count) print '%s unkowns sex' % (sumCount) endTime = time.clock() print "cost time " + str (endTime - startTime) + " s" |
处理速度很快,输出如下:
1
2
3
4
5
6
7
8
|
unkownsex1 - 1001.txt deal 204 lines unkownsex100001 - 101001.txt deal 50 lines unkownsex10001 - 11001.txt deal 206 lines #...省略中间输出信息 unkownsex99001 - 100001.txt deal 56 lines unkownsex_re.txt deal 1085 lines 14223 unkowns sex cost time 0.0813142301261 s |
四、单线程统计 correct 数据
数据格式如下:
1
2
3
4
5
6
7
8
9
10
|
31024 男 2014-11-11 13:20 31283 男 2013-3-25 19:41 31340 保密 2015-2-2 15:17 31427 保密 2014-8-10 09:17 31475 保密 2013-7-2 08:59 31554 保密 2014-10-17 17:02 31621 男 2015-5-16 19:27 31872 保密 2015-1-11 16:49 31915 保密 2014-5-4 11:01 31997 保密 2015-5-16 20:14 |
代码如下,实现思路就是一行一行读取,利用line.split()获取性别信息。sumCount 是统计一个多少人,boycount 、girlcount 、secretcount 分别统计男、女、保密的人数。我们还是利用unicode进行正则匹配。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
import os,sys,time reload (sys) sys.setdefaultencoding( 'utf-8' ) startTime = time.clock() sumCount = 0 boycount = 0 girlcount = 0 secretcount = 0 for filename in os.listdir(r 'E:\pythonProject\ruisi' ): if filename.startswith( 'correct' ): newName = 'E:\\pythonProject\\ruisi\\%s' % (filename) readFile = open (newName, 'r' ) for line in readFile: sexInfo = line.split()[ 1 ] sumCount + = 1 if sexInfo = = u '\u7537' : boycount + = 1 elif sexInfo = = u '\u5973' : girlcount + = 1 elif sexInfo = = u '\u4fdd\u5bc6' : secretcount + = 1 print "until %s, sum is %s boys; %s girls; %s secret;" % (filename, boycount,girlcount,secretcount) print "total is %s; %s boys; %s girls; %s secret;" % (sumCount, boycount,girlcount,secretcount) endTime = time.clock() print "cost time " + str (endTime - startTime) + " s" |
注意,我们输出的是截止某个文件的统计信息,而不是单个文件的统计情况。输出结果如下:
1
2
3
4
5
6
7
|
until correct1 - 1001.txt , sum is 110 boys; 7 girls; 414 secret; until correct100001 - 101001.txt , sum is 125 boys; 13 girls; 542 secret; #...省略 until correct99001 - 100001.txt , sum is 11070 boys; 3113 girls; 26636 secret; until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret; total is 46885 ; 13937 boys; 4007 girls; 28941 secret; cost time 3.60047888495 s |
五、多线程统计数据
为了更快统计,我们可以利用多线程。
作为对比,我们试下单线程需要的时间。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
|
# encoding: UTF-8 import threading import time,os,sys #全局变量 SUM = 0 BOY = 0 GIRL = 0 SECRET = 0 NUM = 0 #本来继承自threading.Thread,覆盖run()方法,用start()启动线程 #这和java里面很像 class StaFileList(threading.Thread): #文本名称列表 fileList = [] def __init__( self , fileList): threading.Thread.__init__( self ) self .fileList = fileList def run( self ): global SUM , BOY, GIRL, SECRET #可以加上个耗时时间,这样多线程更加明显,而不是顺序的thread-1,2,3 #time.sleep(1) #acquire获取锁 if mutex.acquire( 1 ): self .staFiles( self .fileList) #release释放锁 mutex.release() #处理输入的files列表,统计男女人数 #注意这儿数据同步问题,global使用全局变量 def staFiles( self , files): global SUM , BOY, GIRL, SECRET for name in files: newName = 'E:\\pythonProject\\ruisi\\%s' % (name) readFile = open (newName, 'r' ) for line in readFile: sexInfo = line.split()[ 1 ] SUM + = 1 if sexInfo = = u '\u7537' : BOY + = 1 elif sexInfo = = u '\u5973' : GIRL + = 1 elif sexInfo = = u '\u4fdd\u5bc6' : SECRET + = 1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" \ # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) def test(): #files保存多个文件,可以设定一个线程处理多少个文件 files = [] #用来保存所有的线程,方便最后主线程等待所以子线程结束 staThreads = [] i = 0 for filename in os.listdir(r 'E:\pythonProject\ruisi' ): #没获取10个文本,就创建一个线程 if filename.startswith( 'correct' ): files.append(filename) i + = 1 #一个线程处理20个文件 if i = = 20 : staThreads.append(StaFileList(files)) files = [] i = 0 #最后剩余的files,很可能长度不足10个 if files: staThreads.append(StaFileList(files)) for t in staThreads: t.start() # 主线程中等待所有子线程退出,如果不加这个,速度更快些? for t in staThreads: t.join() if __name__ = = '__main__' : reload (sys) sys.setdefaultencoding( 'utf-8' ) startTime = time.clock() mutex = threading.Lock() test() print "Multi Thread, total is %s; %s boys; %s girls; %s secret;" % ( SUM , BOY,GIRL,SECRET) endTime = time.clock() print "cost time " + str (endTime - startTime) + " s" |
输出
1
2
|
Multi Thread, total is 46885 ; 13937 boys; 4007 girls; 28941 secret; cost time 0.132137192794 s |
我们发现时间和单线程差不多。因为这儿涉及到线程同步问题,获取锁和释放锁都是需要时间开销的,线程间切换保存中断和恢复中断也都是需要时间开销的。
六、较多数据的单线程和多线程对比
我们可以对correct、errTime 、unkownsex的文本都进行处理。
单线程代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
# coding=utf-8 import os,sys,time reload (sys) sys.setdefaultencoding( 'utf-8' ) startTime = time.clock() sumCount = 0 boycount = 0 girlcount = 0 secretcount = 0 unkowncount = 0 for filename in os.listdir(r 'E:\pythonProject\ruisi' ): # 有性别、活动时间 if filename.startswith( 'correct' ) : newName = 'E:\\pythonProject\\ruisi\\%s' % (filename) readFile = open (newName, 'r' ) for line in readFile: sexInfo = line.split()[ 1 ] sumCount + = 1 if sexInfo = = u '\u7537' : boycount + = 1 elif sexInfo = = u '\u5973' : girlcount + = 1 elif sexInfo = = u '\u4fdd\u5bc6' : secretcount + = 1 # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount) #没有活动时间,但是有性别 elif filename.startswith( "errTime" ): newName = 'E:\\pythonProject\\ruisi\\%s' % (filename) readFile = open (newName, 'r' ) for line in readFile: sexInfo = line.split()[ 1 ] sumCount + = 1 if sexInfo = = u '\u7537' : boycount + = 1 elif sexInfo = = u '\u5973' : girlcount + = 1 elif sexInfo = = u '\u4fdd\u5bc6' : secretcount + = 1 # print "until %s, sum is %s boys; %s girls; %s secret;" %(filename, boycount,girlcount,secretcount) #没有性别,也没有时间,直接统计行数 elif filename.startswith( "unkownsex" ): newName = 'E:\\pythonProject\\ruisi\\%s' % (filename) # count = len(open(newName,'rU').readlines()) #对于大文件用循环方法,count 初始值为 -1 是为了应对空行的情况,最后+1得到0行 count = - 1 for count, line in enumerate ( open (newName, 'rU' )): pass count + = 1 unkowncount + = count sumCount + = count # print "until %s, sum is %s unkownsex" %(filename, unkowncount) print "Single Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex;" % (sumCount, boycount,girlcount,secretcount,unkowncount) endTime = time.clock() print "cost time " + str (endTime - startTime) + " s" |
输出为
Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.37444645628 s
多线程代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
|
__author__ = 'admin' # encoding: UTF-8 #多线程处理程序 import threading import time,os,sys #全局变量 SUM = 0 BOY = 0 GIRL = 0 SECRET = 0 UNKOWN = 0 class StaFileList(threading.Thread): #文本名称列表 fileList = [] def __init__( self , fileList): threading.Thread.__init__( self ) self .fileList = fileList def run( self ): global SUM , BOY, GIRL, SECRET if mutex.acquire( 1 ): self .staManyFiles( self .fileList) mutex.release() #处理输入的files列表,统计男女人数 #注意这儿数据同步问题 def staCorrectFiles( self , files): global SUM , BOY, GIRL, SECRET for name in files: newName = 'E:\\pythonProject\\ruisi\\%s' % (name) readFile = open (newName, 'r' ) for line in readFile: sexInfo = line.split()[ 1 ] SUM + = 1 if sexInfo = = u '\u7537' : BOY + = 1 elif sexInfo = = u '\u5973' : GIRL + = 1 elif sexInfo = = u '\u4fdd\u5bc6' : SECRET + = 1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" \ # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) def staManyFiles( self , files): global SUM , BOY, GIRL, SECRET,UNKOWN for name in files: if name.startswith( 'correct' ) : newName = 'E:\\pythonProject\\ruisi\\%s' % (name) readFile = open (newName, 'r' ) for line in readFile: sexInfo = line.split()[ 1 ] SUM + = 1 if sexInfo = = u '\u7537' : BOY + = 1 elif sexInfo = = u '\u5973' : GIRL + = 1 elif sexInfo = = u '\u4fdd\u5bc6' : SECRET + = 1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" \ # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) #没有活动时间,但是有性别 elif name.startswith( "errTime" ): newName = 'E:\\pythonProject\\ruisi\\%s' % (name) readFile = open (newName, 'r' ) for line in readFile: sexInfo = line.split()[ 1 ] SUM + = 1 if sexInfo = = u '\u7537' : BOY + = 1 elif sexInfo = = u '\u5973' : GIRL + = 1 elif sexInfo = = u '\u4fdd\u5bc6' : SECRET + = 1 # print "thread %s, until %s, total is %s; %s boys; %s girls;" \ # " %s secret;" %(self.name, name, SUM, BOY,GIRL,SECRET) #没有性别,也没有时间,直接统计行数 elif name.startswith( "unkownsex" ): newName = 'E:\\pythonProject\\ruisi\\%s' % (name) # count = len(open(newName,'rU').readlines()) #对于大文件用循环方法,count 初始值为 -1 是为了应对空行的情况,最后+1得到0行 count = - 1 for count, line in enumerate ( open (newName, 'rU' )): pass count + = 1 UNKOWN + = count SUM + = count # print "thread %s, until %s, total is %s; %s unkownsex" %(self.name, name, SUM, UNKOWN) def test(): files = [] #用来保存所有的线程,方便最后主线程等待所以子线程结束 staThreads = [] i = 0 for filename in os.listdir(r 'E:\pythonProject\ruisi' ): #没获取10个文本,就创建一个线程 if filename.startswith( "correct" ) or filename.startswith( "errTime" ) or filename.startswith( "unkownsex" ): files.append(filename) i + = 1 if i = = 20 : staThreads.append(StaFileList(files)) files = [] i = 0 #最后剩余的files,很可能长度不足10个 if files: staThreads.append(StaFileList(files)) for t in staThreads: t.start() # 主线程中等待所有子线程退出 for t in staThreads: t.join() if __name__ = = '__main__' : reload (sys) sys.setdefaultencoding( 'utf-8' ) startTime = time.clock() mutex = threading.Lock() test() print "Multi Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex" % ( SUM , BOY,GIRL,SECRET,UNKOWN) endTime = time.clock() print "cost time " + str (endTime - startTime) + " s" endTime = time.clock() print "cost time " + str (endTime - startTime) + " s" |
输出为
Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;
cost time 1.23049112201 s
可以看出多线程还是优于单线程的,由于使用的同步,数据统计是一直的。
注意python在类内部经常需要加上self,这点和java区别很大。
1
2
3
4
5
6
7
8
9
10
|
def __init__( self , fileList): threading.Thread.__init__( self ) self .fileList = fileList def run( self ): global SUM , BOY, GIRL, SECRET if mutex.acquire( 1 ): #调用类内部方法需要加self self .staFiles( self .fileList) mutex.release() |
total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.25413238673 s
以上就是本文的全部内容,希望对大家的学习有所帮助。