Total Pageviews

Tuesday 4 December 2018

Empirical Analysis of Internet Filtering in China

The appendix sections below offer technical details beyond those of the main report. It contains the following sections:
     Blocking of Entire Web Sites and Entire Servers
     Reporting Criteria and the "Blocking Quotient" of Reported Sites 
     DNS Filtering/Redirection and Its Implications     Independent Filtering Implementations and Corresponding Circumvention Techniques
     
Other Effects of Chinese Filtering: Routing and Email
     

Blocking of Entire Web Sites and Entire Servers
We conducted testing of only one URL per Web host based on our background knowledge, reinforced by subsequent testing, that when the default page of a site was filtered, the entirety of that site was typically filtered.
To test the hypothesis of entire-site blocking, we formed a sample of web hosts found to be inaccessible, and we checked whether an arbitrary subdirectory on each such site was also inaccessible. Though the arbitrary directory name we chose was intended not to exist on the servers, web servers return a "not found" error message in response to a non-existent request. We confirmed that these error pages themselves were inaccessible in a total of 99.8% of tests. We attribute the other 0.2% of results to anomalies such as transient network errors that may have wrongly rendered the web host inaccessible in the first instance when the host was not intended to be blocked.
At the moment, then, it seems that when the default page ("front page") of a host is blocked, all other pages on that host are also blocked. (Of course, the reverse need not be the case, and the authors have separately confirmed multiple instances in which it is not the case.)
When an entire host is filtered, our data show that this filtering typically operates on the basis of the host's IP addresses rather than on the basis of its one or several domain names. To make this confirmation, we observed that when many web sites are hosted on a single web server (as is typical in commercial "shared hosting" at the lowest monthly rates), blocking by China of one web site on a given server (with a given IP address) typically entails blocking of all other web sites on that server. For example, we found a total of 308 distinct (by domain name and differing page content) blocked sites all hosted on the server at IP address 216.34.94.186, a parking/redirection server used by domain name registrar Dotster. To the extent that this server in fact hosts additional sites beyond those we tested, it is highly likely that they too were blocked. Indeed, a representative of domain name registrar enom reported to the authors that its primary domain name forwarding service had been blocked by China -- rendering unreachable literally hundreds of thousands of domain names that rely on that server.
While filtering of a host's top-level page predicts the filtering of all other pages on that site, such filtering is not technically mandated. Indeed, midway through our testing, the authors learned of and confirmed the blocking of certain pages on otherwise-accessible sites. At least some of this blocking appears to be triggered by one of relatively few keywords in page URLs or contents; this therefore represents a technical layer of blocking wholly distinct from (and seemingly rarer than) that which results in an entire site being made unavailable.
Since blocking typically affects an entire web server, our reporting includes all Yahoo and Google/DMOZ categories that reference any pages on affected web servers.

Reporting Criteria and the "Blocking Quotient" of Reported Sites
In order to sort out intentional blocks from mere unintentional network blockages or other variation we tested candidate URLs multiple times and through multiple proxies. In many cases, sites were unavailable only on one occasion, or unavailable from one proxy in China while available from another. While such phenomena might represent intentional blocking that is simply limited in time or regional scope, we operationalize the notion that a URL is blocked "in China" only when it has been found to be unavailable on at least two occasions, and from at least two distinct proxies, all while still accessible from the United States. Variations in blocking across proxies, if not due to transient network failures, could reflect a distribution of authority to make and implement blocking decisions from one region to the next or a technical burden or delay to readily programming key routers across China to block an undesirable URL.
To the extent that blocking varies across networks and across geographic locations, to describe a URL or entire Web site as "blocked in China" may be inexact -- a site can be found accessible in some places and simultaneously inaccessible in others. In the absence of further data about political decision-making and technical implementation, we can be only as precise as the data is accurate -- and we therefore apply a threshold of overall inaccessibility to determine that a site is "blocked in China."
We have received reports indicating that certain locations -- for example, hotels predominantly frequented by western visitors -- have significantly less stringent filtering policies. Our reporting of sites "blocked in China" should not be taken to describe Internet access from these locations.
Having tested all sites on multiple occasions from multiple distinct locations within China, the authors have found some sites that were blocked consistently -- on all occasions, from all locations -- while other sites were blocked less often. The "blocking quotient" slider in our reporting seeks to characterize this observation: a wide red bar signifies a site blocked more frequently, while a narrower bar denotes intermittent blocking or blocking observed from relatively fewer locations within China. We report this measurement with a slider rather than a number to reflect the uncertainty necessarily associated with these measurements and the resulting analysis.

DNS Filtering/Redirection and Its Implications
For some 1,043 of sites tested, we confirmed that DNS servers in China report a web server other than the official web sever actually designated via each site's authoritative name servers. We call this phenomenon "DNS redirection," though others sometimes refer to the situation as "DNS hijacking." Consistent with prior reporting by Dynamic Internet Technology, our data show that such sites were consistently unreachable in their entirety.
Currently, when a user in China requests a site affected by DNS redirection, the user's computer is told that the site's domain name is associated with the IP address 64.33.88.161. That IP address is associated with the host www.falundafa.ca, the site of a Canadian organization that promotes the practice of Falun Gong. However, that address is itself blocked by Chinese border routers, preventing such requests from reaching either the falundafa server or any other. As a result, Chinese users are unable to reach the entirety of these many sites, including their respective default pages as well as their subsidiary pages.
While the authors cannot know for sure the specific rationale for implementing this additional method of filtering by Chinese network staff, we suggest two possible understandings. First, this method of filtering might be intended to supplement border router filtering; depending on the specific method of implementation, it might be in some way more efficient or easily updated by Chinese network staff, and compliance of ISPs can be more easily monitored remotely via ordinary DNS tools such as dig. Second, this method of filtering is a likely precursor to efforts both to monitor accesses to specific sites and to revise or replace content on those sites with other content specifically provided by Chinese network staff ; either approach would rely on proxy servers to be placed at specified IP addresses and would require that requests for designated sites in some way be redirected to those addresses. While this second theory is largely speculative, it rings true given related efforts to replace Google (see the authors' prior Replacement of Google with Alternative Search Systems in China) and subsequent filtering of certain Google search terms (including the names of key political figures and the terms required to use the Google cache).

Independent Filtering Implementations and Corresponding Circumvention Techniques
We have observed certain idiosyncrasies in Chinese methods of Internet filtering, and in some instances we have found methods to circumvent particular aspects of filtering. Based on this data, we can draw inferences about particular methods of filtering. In this section, we detail these anomalies as well as their implications.
  • Filtering on the basis of web server IP address. As described above, we were able to confirm that filtering was on the basis of IP address by observing that when China blocked access to one web site on a given physical server, all other sites on that physical server (i.e. on that IP address) were also typically blocked.
    • Implementation method: This method of filtering likely relies on block lists loaded into border routers that connect China's internal networks with international networks. ISPs reportedly share block lists, perhaps with additional centralized coordination of updates. Variation across networks and over time is to be expected based on delays in propagation of list revisions. Our data suggest that when Chinese network staff deem a site to contain undesirable content, their most common method of filtering it is simply to drop IP packets destined for it.
    • Circumvention methods: This method of blocking, the most widely-used in our experience, is difficult to circumvent. The typical circumvention method relies on channeling Web page requests and viewing associated results through proxy servers which are themselves outside China. However, monitoring and proxy-blocking efforts reportedly provide a check on the use of proxies. See details in Bennett Haselton's List of possible weaknesses in systems to circumvent Internet censorship and Seth Finkelstein's discussion of filtering "loopholes." When Google's cache feature was available in China, it allowed circumvention of this method of filtering, but this feature has since become unavailable, as described below.
  • Filtering on the basis of domain name server IP address. Like filtering on the basis of web server IP address, this method likely relies on block lists loaded into border routers. Even if the desired web server is itself reachable, a user's computer cannot reach the web server if it cannot first convert the site's domain name into a numeric IP address -- and when the site's DNS server is blocked, no such conversion is possible.
    • Apparent unintentionality of blocking: We have observed that many of the filtered DNS servers are also themselves web servers, or are located on networks that are filtered in totality (as distinguished from networks filtered only in part, i.e. for which certain specific IP addresses are filtered while others remain accessible). This lends some support to the inference that filtering at the level of DNS may be unintentional -- an accidental consequence of filtering a web server or network that also happens to offer domain name services.
    • Circumvention methods: When filtering operates on the basis of domain name server IP address, filtering can sometimes be circumvented via direct entry of the desired web server's IP address. In particular, an interested user may simply enter the IP address of the desired web server directly into a browser's Location bar (into the same location where the site's domain name would ordinarily be placed). Of course, this method requires that the user know the server's IP address (which the user cannot obtain directly through the ordinary domain name system since the domain's DNS server is, by hypothesis, blocked), and it further requires that the server provide only this single site (rather than hosting many sites via HTTP multiplexing). Nonetheless, in some situations entering an IP address directly may prove able to circumvent Chinese filtering efforts. An additional possible method of circumvention is the use of non-Chinese DNS servers, with such servers performing a subset of the role that an overseas proxy would serve to circumvent web host IP blocking. If such an approach became widespread, border routers could be reconfigured to refuse outbound DNS requests except when received from authorized DNS servers.
  • DNS redirection. As described above, DNS servers in China have been found to offer incorrect answers as to the IP addresses of certain domain names.
    • Circumvention methods: Use of non-Chinese DNS servers bypasses this method of filtering, though such use might in the future be blocked by border routers.
  • Filtering on the basis of keywords in URL. Beginning in September 2002, our data reflect that when a subscriber to a Chinese ISP submitted a URL request that itself contains certain words or phrases -- this typically happens for search engine searches, like http://www.google.com/search?q=jiang+zemin -- no response would be received. This effect was particularly notable at Google, where names of key political figures apparently came to be off-limits, as are certain other words used to invoke controversial Google features (among them the caching feature that can allow Google to be a method of circumventing the filtering implementations described above). In some instances, the authors have also observed that these keyword blocks may apply equally to requests from other sites; from at least certain locations in China, attempts to retrieve any URL containing the character string "jiang+zemin" triggers filtering (even if the result of that request would only be a 404 Not Found error page).
    • Additional symptoms noted: Subsequent to a request for a URL with a prohibited term, the authors have received reports of (and have confirmed) "timeout" periods of 5 to 30 minutes during which either the target site or even all sites (including otherwise-permissible sites) became inaccessible. The authors have received further reports that some timeout periods may last until a user's computer is rebooted and/or until a user's DSL modem is powercycled. If intentional, as seems likely, this represents a type of filtering that tries to "train" the end user to avoid using prohibited terms, imposing a penalty beyond inaccessibility of the requested URL should the terms be used.
    • Implementation method: This method of filtering is likely implemented via packet-filtering systems integrated into border routers or placed adjacent to them. See additional discussion below.
    • Circumvention methods: We have observed that keyword-based filtering systems tend to search for plaintext in URL strings -- searching for the word "cache," for example, and blocking any request to google.com that contains this word in its URL. However, the HTTP RFC specification describes additional techniques for encoding ("escaping") characters in a URL (RFC 2396 section 2.4.1). For example, ASCII characters can be encoded in hexadecimal code via escape sequences of the form %4A where 4A is the hexadecimal code of the ASCII character at issue. The authors have confirmed that in at least some instances, Chinese filtering systems of the sort described in this section are not currently triggered by keywords that, when expressed in plain text, consistently prevent access to the requested pages. (This errata reflects a failure to properly implement the comparison specified in RFC 2616 section 3.2.3.)
  • Filtering on the basis of keywords or phrases in HTML response. Beginning in September 2002, the authors observed that certain keywords in HTML response pages seemed to be blocked by Chinese network infrastructure. In particular, even when a page came from a server not otherwise filtered, and even when the page featured a URL without controversial search terms, it might nonetheless be inaccessible if the page itself contained particular controversial terms. Such pages were often truncated, i.e. interrupted midway through their display. On certain browsers, including recent versions of Microsoft Internet Explorer, pages truncated in this way may flash briefly on screen, then disappear. This phenomenon represents an augmentation of "compiled" filtering with "interpreted" filtering -- the former representing specific sites deemed ex ante to be off-limits, with routers configured accordingly, and the latter representing data deemed on-the-fly, mechanically, to be off-limits, with corresponding temporary loss of access to the source of that data.
    • Level of accuracy: The authors have observed that filtering on the basis of keywords in sometimes seemed to malfunction, i.e. to allow passage and viewing of a page that contained words that were otherwise prohibited. This occurrence seemed to be random, but in some instances seemed to take place as often as not.
    • Implementation method: The observed results are precisely what would be expected if Chinese border routers (or associated hardware) implemented a packet-filtering system triggered by particular controversial keywords. To reduce memory and processor requirements, such systems promptly pass on all packets found to be acceptable. However, upon the receipt of the first packet containing a prohibited term, a packet-filtering system would be configured to discard all further packets from the same source and/or destination for some designated period -- causing the page truncation consistently observed under these circumstances. The randomness in successful filtering might reflect that packet filtering operates at less than line speed, i.e. is able to inspect only a portion of content passing through a given router. It might also reflect that packet filtering fails to take account of borders between packets, such that a page is permitted to be viewed if a part of a prohibited word is received in one packet and the remainder in a subsequent packet.
    • Additional symptoms noted: Timeout periods, as described above.
    • Circumvention method: Based on our understanding of the likely implementation method of such filtering, the authors note two possible means of circumventing this filtering. First, content providers can escape their text, using HTML markup that is equivalent to the characters at issue or adding HTML whitespace (comment tags, etc.) in the middle of controversial words or phrases. (These techniques are as documented in HTML specifications for character entity references and comments.) Second, Chinese users can reduce their TCP/IP stack's specified maximum transmission unit (MTU) -- reducing the amount of text contained in a given packet and thereby reducing the effectiveness of packet-inspection systems; however, this approach typically reduces performance and also increases network overhead.
These final two methods of filtering -- on the basis of keywords in URLs and HTML responses -- are not the primary focus of our reporting. Instead, our current work focuses on web sites filtered in their entirety; in future work, we will seek to document the specific keywords found to be prohibited in searches, URLs, and HTML response pages, and more important, the evolving prevalence of each type of filtering.

Other Effects of Chinese Filtering: Routing and Email
Routing. The authors have observed that some American ISPs route packets through China towards destinations beyond China (in particular, to Hong Kong). When the desired web servers are blocked from China, such a routing typically yields to filtering by network equipment in China of an American user's request. In response to this problem, affected American ISPs can address the situation by manually altering the routes used to reach hosts in Hong Kong and elsewhere. However, affected ISPs are often unaware of the situation, and an effective response requires delay and/or causes additional expense as an affected ISP finds the necessary partner ISPs and establishes peering relationships with them.
Email. When border routers in China discard packets destined to or received from certain hosts, we understand that they typically do so without regard for the specified protocol of communications. As a result, email messages are typically filtered when sent to or received from blocked sites. The authors understand that additional filtering efforts may specifically target certain controversial emails, and the authors plan to document this situation in detail in future work.
Other Protocols: Filtering on the basis of server IP address can restrict additional protocols of Internet communications. For example, FTP is as affected as the web by blocking of a requested server's IP address. The authors have also received reports of failures of instant messaging software, likely reflecting difficulty in passing packets to and from designated servers.

FROM https://cyber.harvard.edu/filtering/china/appendix-tech.html
(https://cyber.harvard.edu/filtering/china/)

No comments:

Post a Comment