Funky's NoteBook

Scraping-Primer

字数统计: 587阅读时长: 3 min
2018/07/01 Share

Scraping Primer

识别网站

builtwith模块
  • 安装

    1
    pip installl builtwith
  • 使用

    1
    2
    3
    4
    5
    import builtwith
    print(builtwith.parse('http://example.webscraping.com'))

    ## 输出结果
    {'web-servers': ['Nginx'], 'web-frameworks': ['Web2py', 'Twitter Bootstrap'], 'programming-languages': ['Python'], 'javascript-frameworks': ['jQuery', 'Modernizr', 'jQuery UI']}

寻找网站所有者

python-whois
  • 安装

    1
    pip install python-whois
  • 使用

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    import whois
    print(whois.whois('appspot.com'))

    ## 输出结果
    {
    "domain_name": [
    "APPSPOT.COM",
    "appspot.com"
    ],
    "registrar": "MarkMonitor, Inc.",
    "whois_server": "whois.markmonitor.com",
    "referral_url": null,
    "updated_date": [
    "2018-02-06 10:30:28",
    "2018-02-06 02:30:29-08:00"
    ],
    "creation_date": [
    "2005-03-10 02:27:55",
    "2005-03-09 18:27:55-08:00"
    ],
    "expiration_date": [
    "2019-03-10 01:27:55",
    "2019-03-09 00:00:00-08:00"
    ],
    "name_servers": [
    "NS1.GOOGLE.COM",
    "NS2.GOOGLE.COM",
    "NS3.GOOGLE.COM",
    "NS4.GOOGLE.COM",
    "ns1.google.com",
    "ns4.google.com",
    "ns2.google.com",
    "ns3.google.com"
    ],
    "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
    "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",
    "clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)",
    "clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)",
    "clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)",
    "serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)",
    "serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)",
    "serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)"
    ],
    "emails": [
    "abusecomplaints@markmonitor.com",
    "whoisrelay@markmonitor.com"
    ],
    "dnssec": "unsigned",
    "name": null,
    "org": "Google LLC",
    "address": null,
    "city": null,
    "state": "CA",
    "zipcode": null,
    "country": "US"
    }

编写第一个网络爬虫

下载网页
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import urllib.request

def download(url, user_agent='wswp', num_retries=2):
print("Downloading:"+url)
headers = {'User_agent': user_agent}
request = urllib.request.Request(url, headers=headers)
try:
html = urllib.request.urlopen(request).read()
except urllib.request.URLError as e:
print("Error:"+e.reason)
html = None
if num_retries > 0:
if hasattr(e,'code') and 500 <= e.code < 600:
return download(url, user_agent, num_retries-1)
return html

## 尝试下载并捕获异常
download("http://httpstat.us/500")

## 输出
Downloading:http://httpstat.us/500
Error:Internal Server Error
Downloading:http://httpstat.us/500
Error:Internal Server Error
Downloading:http://httpstat.us/500
Error:Internal Server Error
网站地图爬虫
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def crawl_sitemap(url):
sitemap = download(url)
links = re.findall('<loc>(.*?)</loc>', sitemap.decode('utf-8'))
for link in links:
html = download(link)

## 测试
crawl_sitemap("http://example.webscraping.com/sitemap.xml")

## 输出
Downloading:http://example.webscraping.com/sitemap.xml
Downloading:http://example.webscraping.com/places/default/view/Afghanistan-1
Downloading:http://example.webscraping.com/places/default/view/Aland-Islands-2
Downloading:http://example.webscraping.com/places/default/view/Albania-3
Downloading:http://example.webscraping.com/places/default/view/Algeria-4
Downloading:http://example.webscraping.com/places/default/view/American-Samoa-5
CATALOG
  1. 1. Scraping Primer
    1. 1.0.1. 识别网站
      1. 1.0.1.1. builtwith模块
    2. 1.0.2. 寻找网站所有者
      1. 1.0.2.1. python-whois
    3. 1.0.3. 编写第一个网络爬虫
      1. 1.0.3.1. 下载网页
      2. 1.0.3.2. 网站地图爬虫