Scraping Primer
识别网站
builtwith模块
- 安装 - 1 - pip installl builtwith 
- 使用 - 1 
 2
 3
 4
 5- import builtwith 
 print(builtwith.parse('http://example.webscraping.com'))
 ## 输出结果
 {'web-servers': ['Nginx'], 'web-frameworks': ['Web2py', 'Twitter Bootstrap'], 'programming-languages': ['Python'], 'javascript-frameworks': ['jQuery', 'Modernizr', 'jQuery UI']}
寻找网站所有者
python-whois
- 安装 - 1 - pip install python-whois 
- 使用 - 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61- import whois 
 print(whois.whois('appspot.com'))
 ## 输出结果
 {
 "domain_name": [
 "APPSPOT.COM",
 "appspot.com"
 ],
 "registrar": "MarkMonitor, Inc.",
 "whois_server": "whois.markmonitor.com",
 "referral_url": null,
 "updated_date": [
 "2018-02-06 10:30:28",
 "2018-02-06 02:30:29-08:00"
 ],
 "creation_date": [
 "2005-03-10 02:27:55",
 "2005-03-09 18:27:55-08:00"
 ],
 "expiration_date": [
 "2019-03-10 01:27:55",
 "2019-03-09 00:00:00-08:00"
 ],
 "name_servers": [
 "NS1.GOOGLE.COM",
 "NS2.GOOGLE.COM",
 "NS3.GOOGLE.COM",
 "NS4.GOOGLE.COM",
 "ns1.google.com",
 "ns4.google.com",
 "ns2.google.com",
 "ns3.google.com"
 ],
 "status": [
 "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
 "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
 "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
 "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
 "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
 "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",
 "clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)",
 "clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)",
 "clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)",
 "serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)",
 "serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)",
 "serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)"
 ],
 "emails": [
 "abusecomplaints@markmonitor.com",
 "whoisrelay@markmonitor.com"
 ],
 "dnssec": "unsigned",
 "name": null,
 "org": "Google LLC",
 "address": null,
 "city": null,
 "state": "CA",
 "zipcode": null,
 "country": "US"
 }
编写第一个网络爬虫
下载网页
| 1 | import urllib.request | 
网站地图爬虫
| 1 | def crawl_sitemap(url): |