Scraping Primer
识别网站
builtwith模块
安装
1
pip installl builtwith
使用
1
2
3
4
5import builtwith
print(builtwith.parse('http://example.webscraping.com'))
## 输出结果
{'web-servers': ['Nginx'], 'web-frameworks': ['Web2py', 'Twitter Bootstrap'], 'programming-languages': ['Python'], 'javascript-frameworks': ['jQuery', 'Modernizr', 'jQuery UI']}
寻找网站所有者
python-whois
安装
1
pip install python-whois
使用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61import whois
print(whois.whois('appspot.com'))
## 输出结果
{
"domain_name": [
"APPSPOT.COM",
"appspot.com"
],
"registrar": "MarkMonitor, Inc.",
"whois_server": "whois.markmonitor.com",
"referral_url": null,
"updated_date": [
"2018-02-06 10:30:28",
"2018-02-06 02:30:29-08:00"
],
"creation_date": [
"2005-03-10 02:27:55",
"2005-03-09 18:27:55-08:00"
],
"expiration_date": [
"2019-03-10 01:27:55",
"2019-03-09 00:00:00-08:00"
],
"name_servers": [
"NS1.GOOGLE.COM",
"NS2.GOOGLE.COM",
"NS3.GOOGLE.COM",
"NS4.GOOGLE.COM",
"ns1.google.com",
"ns4.google.com",
"ns2.google.com",
"ns3.google.com"
],
"status": [
"clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
"clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
"clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
"serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
"serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
"serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",
"clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)",
"clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)",
"clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)",
"serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)",
"serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)",
"serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)"
],
"emails": [
"abusecomplaints@markmonitor.com",
"whoisrelay@markmonitor.com"
],
"dnssec": "unsigned",
"name": null,
"org": "Google LLC",
"address": null,
"city": null,
"state": "CA",
"zipcode": null,
"country": "US"
}
编写第一个网络爬虫
下载网页
1 | import urllib.request |
网站地图爬虫
1 | def crawl_sitemap(url): |