Total Pageviews

Sunday 10 March 2013

http-proxy-cache

什么是代理缓存?
下面是使用代理服务器的三大理由:
  1. 因为你在防火墙背后(为了安全),因此必须使用。
  2. 因为使用缓存可以显著提高任何人的页面浏览速度。
  3. 因为对于你的机器,你没有足够的“真实”IP可用。
如果你在防火墙后面,那么你其实可能已经使用了一个代理。否则,你要考虑安装一个。
Netscape Navigator 和一些其他较新的浏览器内建了缓存机制。在一个单用户系统中,诸如使用拨号的PC机,这应该足够了。你可以调整缓存参数让缓存变得大一点,或更频繁地检查缓存条目。顺带提一下,在Netscape中,“刷新”通常不会去获取一份文档完全最新的副本,他会发送一个GET If-Modified-Since以及Pragma: no-cache。按下Shift再按“刷新”会强制所有页面框架都通过发送Pragma: no-cache来完全从来源重新加载。如果要在Netscape中查询磁盘缓存输入about:cache
即使在单个系统上,如果你使用多于一个的浏览器或者有多于一个的用户,一个代理缓存也许会更有效,因为缓存的文档可以在各种媒介中共享。

LAN系统

多个用户在LAN中使用代理就可以产生的真正的好处。由任何一个人访问的任何一个新的页面都会被存储在缓存中。下一个同样访问这个页面的人将直接获取缓存了的副本,以完全的LAN的网速,而不用再去访问源地址。这可能会快1000倍甚至更多。
针对Windows系统:

配置浏览器的代理

一些浏览器可能只接受一个值,例如(在Unix中)可以通过使用 setenv http_proxy http://somewhere.org:80/。某些域名可以被排除,通过使用 setenv no_proxy some.org,some.other.org。其他浏览器,如Netscape,有一个更复杂的方式来支持多种代理。Netscape还有一个通过使用JavaScript来自动处理代理的方式。Mosaic-2.7页运行代理列表。

绕过缓存

如果一个HTTP请求含有Pragma: no-cache的头,那么缓存就会被指向获取一个新的副本。可能它自己也会保存一个新的副本。在Netscape、Mosaic和Lynx(可能所有的浏览器)中使用刷新都会发送一个带有这个头的请求。

可缓存和不可缓存的文档

一般的HTML文档通常都是可以缓存的。缓存代理一般要求一个有效的Last-Modified头,同时可能不缓存大于 某个尺寸或主题受限制的对象。由CGI脚本生成的HTML文档通过生成合法的Expires头来变得可缓存(或不可),虽然某些代理不会缓存带有 “cgi-bin”或查询字符串的URL。要求身份验证的文档一般都不会被缓存。Netscape有一个选项让你可以从本地缓存SSL服务器上获取的文 档。如果你打开了这个选项,其他获得了你的机器访问权限的人(很有可能是窃取的)就可以读取你近期的安全事务。
注意不同的缓存服务器也许在解释HTTP标准上会略有不同,所以某一个可以缓存的文档不一定能被另一个所缓存。

缓存控制和CGI

我使用缓存测试脚本针对Apache 1.1.1和Squid 1.0.5进行测试,获得了以下结果:I
Expires Last-Modified Apache Squid
Tonight Last Night 缓存 缓存
+1 minute Last Night Expires Expires
Tonight none Not Cached Cached
none none Not Cached Cached
0 Last Night Not Cached Not Cached
Last Night Last Night Not Cached Not Cached
Tonight Tonight Cached Cached
Tonight 0 Not Cached Cached
Squid缓存的默认配置是不去缓存中间有“cgi-bin”和“?”的URL,我去掉了这个配置,获得了这种结果。
RFC1945 (HTTP1.0标准)说Expires过期时间等于或早于Date头的值,接受者便不能缓存文档。零值(0)或一个非法的日期格式被认为等同于“立刻失效”。
建议在 CGI 脚本中使用:如果CGI脚本的输出确实是一个静态的文档,对于同一个查询字符串(如/cgi-bin/search?query=food)有同样的内容,那么可以生成一个合适的Last-Modified字段(最后修改日期),以RFC1123的格式。如果输出的内容在某个特定的时间段内是合法的,那么生成合适的Expires头。如果输出的内容里立刻就要失效,或者失笑于某个特定时间,则生成一个Expires头等同于当前时间或一个非法值。
为了能更好的使用全部的带宽,尽可能地多缓存东西。也就是说,例如,如果你有一个网络摄像头,显示办公室窗口内的景象,你也许可以生成一个Expires头10分钟或更多。

服务器端包含(SSI)

Apache, NCSA, 以及一些其他的服务器可以在HTML(.shtml)中使用服务器端包含(Server-Side Include)。由于文档的内容是由几个被引用的文件组成的,服务器一般不会设置一个Last-Modified日期或者Content-Length。相应的,这类文档是无法缓存的(因此也要比其他使用缓存的人载入慢很多)。Apache支持一个选项叫做XBitHack,它可以发送一个Last-Modified日期。如果你用了这个,你必须在任何一个被引用的文件更改的时候,把.shtml包装器文件更改为最新时间(Unix命令为touch),否则其他使用缓存的人将无法看到你的新文档除非他们明确使用更新。

Content-Negotiation(内容判断)

如果正在使用内容判断,来提供不同的语言或图像类型,那么对于不同的内容可能只有一个URL。相应地,这样Apache服务器就不会设置Last-Modified。这个问题在HTTP1.1草案中已经通过Vary头来解决。

文档过期

代理缓存会看Expires(过期)头同时使用它在缓存中设置一个失效的日期。如果不存在这个HTTP头,那么将假设为默认的生命周期。这时我不知道代理检查了HTML内容中的META标签;然而CERN以及新的Apache服务器可能会使用一个元数据文件方案来在文档头中生成额外的字段如Expires。CGI脚本也可以使用类似LWP Perl的库来生成合适的字段。一个非法的Expires头,例如0、负数等,也会使文档无法缓存(里可以失效)。

谁在使用代理

尽管RFC1945推荐不要更改User-Agent字段,特定的代理仍会这样,于是我们可以通过它来统计信息。大约8%的浏览本页面的用户使用以下的代理之一:
  • CERN-HTTPD
  • Harvest Cache
  • Squid Cache
所有在防火墙之后的用户必须使用代理,虽然缓存是完全另一个问题。SOCKS则是一个无缓存的代理方式。

NLANR 缓存项目

看看NLANR的分布式缓存项目!

未来

HTTP1.1(ds.internic.net正在制作草案)有更多针对代理服务器的方案。我们也许可以拥有很多交互的代理缓存可以极高地提高整个互联网。任何有快于14.4kbps moderm的人都会受益。
另一个发展是预抓取代理缓存,例如 Wcol: WWW Collector。这里,这个代理积极的提前寻找相关的图像和页面。原来仅针对Mosaic提速,现在他可以和一个层次缓存机制相互协作当用户读取了第一个页面之后来获取相关的页面。
-------------------------------------------------------
英文原文:

HTTP Proxy Cache

These are just a few notes from a days tinkering with Proxy Cache in Apache 1.1.1., and Squid 1.0.5 What is Proxy Cache?
There are three reasons for using a proxy server:
  1. Because you are behind a firewall (for security) and you have to.
  2. Because using a cache speeds up Web browsing significantly, for you and everyone else.
  3. Because you don't have enough 'real' IP addresses for your machines.
If you are behind a firewall you are probably using one already. If not, you might consider installing one. Netscape Navigator and other newer browsers have cacheing built in. On a single-user system, such as a PC on a phone line, this may be adequate. You can tune the cache parameters to make the cache larger, or check entries more often. Incidentally, in Netscape Reload does not always get a fresh copy of a document; it sends GET If-Modified-Since with Pragma: no-cache. Shift-Reload (holding down Shift while clicking Reload) will force all frames to be reloaded from source by sending Pragma: no-cache. For information on your disk cache in Netscape type about:cache, or about:memory-cache, about:image-cache for information about the RAM and image caches. For information about a document, see about:document.
Even on a single system, if you use more than one browser or have more than one user, a proxy cache may help since cached documents can be shared among all agents.

LAN systems

The real benefits accrue from using a proxy cache on a LAN with many users. Any new page accessed by anyone is stored in the cache. The next person to access that page gets the cached copy, at full LAN speed, rather than going to the source. This may be a thousand times faster, or more. Systems for Windows:

(List from John R Buchan)

Configuring a browser for proxy

Some browsers may accept only one value, for instance (on Unix) by using setenv http_proxy http://somewhere.org:80/. Certain domains may be excluded, typically ones own domain, by using setenv no_proxy some.org,some.other.org. Other browsers, such as Netscape, have a more sophisticated scheme for supporting multiple proxies. Netscape has a scheme for automated proxy handling using Javascript. Mosaic-2.7 also allows a list of proxies.

Bypassing cache

If an http request has the Pragma: no-cache header set, then the cache is directed to get a new copy. It may, however, save the new copy itself. Using Reload on Netscape, Mosaic and Lynx (possibly all browsers) sends a request with this header.

Cacheable and uncacheable documents

Regular HTML files are usually cacheable. Cacheing agents may require a valid Last-Modified header , and may not cache objects greater than a certain size or subject to other restrictions. HTML documents generated by CGI scripts can be made cacheable or not by generating an Expires header, though some agents may not cache URLs with "cgi-bin" or a query string. Documents requiring authorisation should not normally be cached. Netscape has an option to cache documents obtained from an SSL (Secure) server locally. If you turn this on, someone who gains access to your computer (perhaps by stealing it) can read all your recent secure transactions. Note that different cache servers may interpret the http specification in slightly different ways, so that a document cached by one may not be cached by another.

Cache Control and CGI

I obtained the following results with Apache 1.1.1 and Squid 1.0.5, using the cache test script:
ExpiresLast-ModifiedApacheSquid
TonightLast NightCachedCached
+1 minuteLast NightExpiresExpires
TonightnoneNot CachedCached
nonenoneNot CachedCached
0Last NightNot CachedNot Cached
Last NightLast NightNot CachedNot Cached
TonightTonightCachedCached
Tonight0Not CachedCached

The default configuration of the Squid cache is not to cache URLs with "cgi-bin" or "?"; this has been commented out to obtain these results. RFC1945 (the HTTP1.0 spec.) says that if the Expires date is equal to or earlier than the value of the Date header, the recipient must not cache the document. A value of zero (0) or an invalid date format should be considered equivalent to an "expires immediately."
Suggested use in CGI scripts: If the output of the CGI script is really a static document, and is the same for the same query string (e.g. /cgi-bin/search?query=food), generate an appropriate Last-Modified date in RFC1123 format . If the output is considered valid for a particular length of time, generate the appropriate Expires header. If the output is immediately invalid, or depends on other data, generate an Expires header equal to the current time, or an illegal (0) value.
In order to make better use of the global bandwidth, it is probably a good idea to make as many things cacheable as possible. This means, for instance, that if you have a Webcam showing the view from an office window, essentially looking at the weather, you might generate an Expires header 10 minutes or more in the future.
See this script (log-tail.pl) for an example.

Server-Side includes

Apache, NCSA, and some other servers allow server-side-includes in HTML (.shtml) files. Since the contents of the document is composed of several included files, the server does not normally set a Last-Modified date or Content-Length. Accordingly, such documents are uncacheable (and will therefore load much slower for someone around the world who is using cache. Apache supports an option known as XBitHack which allows a Last-Modified date to be sent. If you use this, you must touch the .shtml wrapper file any time the included files are changed, else people using cache will not see your new document unless they explicitly Reload.

Content-Negotiation

If content negotiation is being used, to serve different languages or image types, then there is only one URL with possibly different contents. Accordingly, the later Apache servers do not set Last-Modified for content-negotiated documents. This problem is addressed in the draft HTTP 1.1 specification using the Vary header.

Expiring documents

Proxy caches look at the Expires header and use it to set an expiry date in the cache. If one does not exist, a default lifetime is assumed. At this time I am unaware of proxies examining HTML content for META tags; however CERN and new Apache servers may use a metadata file scheme to generate extra fields such as Expires in the document head. CGI scripts may generate appropriate fields explicitly using e.g. the LWP Perl library. An invalid Expires header, such as a value of "0", makes the document uncacheable.

Who is using Proxies

Although RFC1945 recommends not modifying the User-Agent field, certain proxies do and they can be counted. About 8% of hits here use one of the following proxies:
  • CERN-HTTPD
  • Harvest Cache
  • Squid Cache
All users behind firewalls must use proxy, though cache is strictly a separate issue. SOCKS is a non-caching proxy scheme.

NLANR Cache Project

Check out the Distributed Cache project at NLANR!

The Future

HTTP 1.1 (in draft at ds.internic.net) has many more schemes for proxy servers. We will probably a net of interacting proxy caches which will vastly speed up the entire Web. Anyone who has faster Net access than a 14.4 modem will benefit. Another development is the pre-fetching proxy cache, for instance Wcol: WWW Collector. Here, the proxy actively seeks out related images and pages ahead of time. Originally written to speed up Mosaic (which would not display anything until all images had been fetched), it can co-operate with a hierarchical cache scheme to get related pages while the user reads the first one.

Testing Proxy Cache

It's not obvious how the cache is working, unless you have access to the proxy server (Apache 1.1.1 with -DEXPLAIN, for instance). You can experiment with the Netscape cache using the Cache Tester here.
http://vancouver-webpages.com/CacheNow/

from  http://vancouver-webpages.com/proxy.html
-----------------------------------------------------

Proxy Cache - Van-Pool for the Web

Do you wonder why your T-1, cable or ISDN connection is sometimes as slow as 28.8? Have you ever wondered what happens when you click in Netscape? Read on. In the summer of '96, thousands of you probably followed some of the Olympics on the Web. Thousands of identical copies of the same pages traversed a dozen computers from Atlanta to Vancouver. Does something about this strike you as crazy? There is an alternative; it's been part of the Web protocol for a long time but is only now becoming widely used. This is the concept of proxy cache.
Most modern browsers incorporate a local cache; in Netscape if you open the document "about:cache" you can see the current state of yours. This uses some of your hard disk to store pages and images you've seen. If you reload one of these pages, Netscape will issue a Get-If-Modified-Since request, reloading the whole file from the server only if it has changed. This is a great improvement over the original browsers, but still short of the ideal.

Proxy Cache is essentially a simple concept. Suppose that each member of your family has their own computer, you have Ethernet around your house and an ISDN line to an ISP. If each person selects the CNN homepage, several identical copies will be transferred over the ISDN at 128kbps. With a proxy cache; the first person will get the page at 128kbps. The rest will get it at 10Mbps over Ethernet. The same argument applies to an office with a 100Mbps LAN and a T-1 connection to the net.

Hierarchical Cache Servers


The use of cache outline above provides significant benefit where there is a reduction of bandwidth - going from a LAN to WAN, for instance. Cache servers can be used in a more sophisticated way. They can be configured not to request a document directly from the origin server, but from a parent or neighbour cache. Again, cache placed where there is a reduction in bandwidth, such as at national or ocean boundaries, can provide great benefits. NLANR in the US has set up such a scheme, as have various organizations in Europe and elsewhere. A browser requesting a page from Japan might look first in its local cache, then in a LAN cache, then in several neighbour caches, then in a national cache, before getting the page from the overseas server. Properly used, these schemes can turn the Infohighway from a two-lane road to a six-lane expressway.

Controlling Cache

Using cache requires a certain amount of thought, or else pages can be refreshed either too slowly or too rapidly. The http Expires header on a document explicitly gives its expiry date, causing caches to delete it at a specific time. A new request will then reload the updated document. If the Expires header is zero or "now", the document cannot be cached. This feature is supported in some servers such as Apache 1.1.1 and the CERN httpd, and can be used in CGI scripts. If the Expires header is not present, the cache guesses the expiry date based on the documents age. In any case, Reload in Netscape and other browsers will cause the document modification date to be checked. Shift-Reload in Netscape will unconditionally get a new document from the origin server. A online version of this article with links is available at vancouver-webpages.com/proxy/.

Further Reading


 from http://vancouver-webpages.com/proxy/