python实现web邮箱扫描的示例(附源码)_Python

信息收集是进行渗透测试的关键部分，掌握大量的信息对于攻击者来说是一件非常重要的事情，比如，我们知道一个服务器的版本信息，我们就可以利用该服务器框架的相关漏洞对该服务器进行测试。那么如果我们掌握了该服务器的管理员的邮箱地址，我们就可以展开一个钓鱼攻击。所以，对web站点进行邮箱扫描，是进行钓鱼攻击的一种前提条件。

下面，我们利用python脚本来实现一个web站点的邮箱扫描爬取。目的是在实现这个脚本的过程中对python进行学习

最后有完整代码

基本思路

我们向工具传入目标站点之后，首先要对输入进行一个基本的检查和分析，因为我们会可能会传入各种样式的地址，比如http://www.xxxx.com/、http://www.xxxx.com/123/456/789.html等等，我们需要对其进行简单的拆分，以便于后面链接的爬取
通过requests库爬取目标地址的内容，并且在内容通过正则表达式中寻找邮箱地址
查找爬取的网站中的超链接，通过这些超链接我们就能进入到该站点的另外一个页面继续寻找我们想要的邮箱地址。
开工：

该脚本所需要的一些库

				?

									from bs4 import BeautifulSoup #BeautifulSoup最主要的功能是从网页抓取数据，Beautiful Soup自动将输入文档转换为Unicode编码

									import requests #requests是python实现的最简单易用的HTTP库

									import requests.exceptions

									import urllib.parse

									from collections import deque #deque 是一个双端队列, 如果要经常从两端append 的数据, 选择这个数据结构就比较好了, 如果要实现随机访问,不建议用这个,请用列表. 

									import re #是一个正则表达式的库

获取扫描目标

				?

									user_url=str(input('[+] Enter Target URL to Scan:'))

									urls =deque([user_url]) #把目标地址放入deque对象列表

									scraped_urls= set()#set() 函数创建一个无序不重复元素集，可进行关系测试，删除重复数据，还可以计算交集、差集、并集等。

									emails = set()

对网页进行邮箱地址爬取（100条）

首先要对目标地址进行分析，拆分目标地址的协议，域名以及路径。然后利用requests的get方法访问网页，通过正则表达式过滤出是邮箱地址的内容。'[a-z0-0.-+]+@[a-z0-9.-+]+.[a-z]+'，符合邮箱格式的内容就进行收录。

				?

									count=0

									try:

									  while len(urls):  #如果urls有长度的话进行循环

									    count += 1      #添加计数器来记录爬取链接的条数 

									    if count ==101:

									      break

									    url = urls.popleft() #popleft（）会删除urls里左边第一条数据并传给url

									    scraped_urls.add(url)

									    parts = urllib.parse.urlsplit(url) # 打印 parts会显示：SplitResult(scheme='http', netloc='www.baidu.com', path='', query='', fragment='')

									    base_url = '{0.scheme}://{0.netloc}'.format(parts)#scheme：协议；netloc：域名 

									    path = url[:url.rfind('/')+1] if '/' in parts.path else url#提取路径

									    print('[%d] Processing %s' % (count,url))

									    try:

									      head = {'User-Agent':"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11"}

									      response = requests.get(url,headers = head)

									    except(requests.exceptions.MissingSchema,requests.exceptions.ConnectionError):

									      continue

									    new_emails = set(re.findall(r'[a-z0-0\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+', response.text ,re.I))#通过正则表达式从获取的网页中提取邮箱，re.I表示忽略大小写

									    emails.update(new_emails)#将获取的邮箱地址存在emalis中。

通过锚点进入下一网页继续搜索

				?

									    soup = BeautifulSoup(response.text, features='lxml')

									    for anchor in soup.find_all('a'):  #寻找锚点。在html中，<a>标签代表一个超链接，herf属性就是链接地址

									      link = anchor.attrs['href'] if 'href' in anchor.attrs else '' #如果，我们找到一个超链接标签，并且该标签有herf属性，那么herf后面的地址就是我们需要锚点链接。

									      if link.startswith('/'):#如果该链接以/开头，那它只是一个路径，我们就需要加上协议和域名，base_url就是刚才分离出来的协议+域名

									        link = base_url + link

									      elif not link.startswith('http'):#如果不是以/和http开头的话，就要加上路径。

									        link =path + link

									      if not link in urls and not link in scraped_urls:#如果该链接在之前没还有被收录的话，就把该链接进行收录。

									        urls.append(link)

									except KeyboardInterrupt:

									  print('[+] Closing')

									for mail in emails:

									  print(mail)

完整代码

				?

									from bs4 import BeautifulSoup

									import requests

									import requests.exceptions

									import urllib.parse

									from collections import deque

									import re

									user_url=str(input('[+] Enter Target URL to Scan:'))

									urls =deque([user_url])

									scraped_urls= set()

									emails = set()

									count=0

									try:

									  while len(urls):

									    count += 1

									    if count ==100:

									      break

									    url = urls.popleft()

									    scraped_urls.add(url)

									    parts = urllib.parse.urlsplit(url)

									    base_url = '{0.scheme}://{0.netloc}'.format(parts)

									    path = url[:url.rfind('/')+1] if '/' in parts.path else url

									    print('[%d] Processing %s' % (count,url))

									    try:

									      head = {'User-Agent':"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11"}

									      response = requests.get(url,headers = head)

									    except(requests.exceptions.MissingSchema,requests.exceptions.ConnectionError):

									      continue

									    new_emails = set(re.findall(r'[a-z0-0\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+', response.text ,re.I))

									    emails.update(new_emails)

									    soup = BeautifulSoup(response.text, features='lxml')

									    for anchor in soup.find_all('a'):

									      link = anchor.attrs['href'] if 'href' in anchor.attrs else ''

									      if link.startswith('/'):

									        link = base_url + link

									      elif not link.startswith('http'):

									        link =path + link

									      if not link in urls and not link in scraped_urls:

									        urls.append(link)

									except KeyboardInterrupt:

									  print('[+] Closing')

									for mail in emails:

									  print(mail)