HTTP URIs do not resolve from NL and DE?
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Ennomeijers
	Mar 1 2023, 4:29 PM

Description

Steps to replicate the issue (include links if applicable):

From my Ubuntu laptop in the Netherlands running the following curl command:

$ curl -v -I http://www.wikidata.org/entity/Q42

This request should be redirected to the https version with a possible redirection to a content page.

Instead it results in the following error:

Trying 91.198.174.192:80...
Trying 2620:0:862:ed1a::1:80...
Immediate connect fail for 2620:0:862:ed1a::1: **Network is unreachable**

I get the same result when doing a GET request on https://reqbin.com/ with DE set for server location.

When switching the server location to US, it results in the expected behavior and comes back with the data.

The same happens when pasting the http url in my (firefox) browser.

Related Objects
Search...

		Status	Subtype	Assigned	Task
					Restricted Task
		Resolved	BUG REPORT	Ennomeijers	T330906 HTTP URIs do not resolve from NL and DE?

Event Timeline

Ennomeijers created this task.Mar 1 2023, 4:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 1 2023, 4:29 PM

Please use HTTPS rather than unencrypted HTTP, in any URIs referencing Wikimedia sites, e.g. https://www.wikidata.org/entity/Q42

Cannot reproduce from Central Europe:

* Connected to www.wikidata.org (91.198.174.192) port 80 (#0)
[...]
HTTP/1.1 301 TLS Redirect

The intermittent availability of port 80 is part of ongoing operational work, which is why it worked briefly earlier. However, the correct fix from the user POV is to not use port 80 in the first place (use HTTPS, not HTTP).

As already mentioned by @BBlack, HTTPS is the way to contact Wikimedia sites.
I'm a little bit curious about your firefox setup, as wikidata.org is included in the HSTS preload list (see https://hstspreload.org/?domain=wikidata.org) your browser should internally redirect you to https://www.wikidata.org (chrome flags this on the developer console with a 307 internal redirect response but my fireefox 102.8.0esr doesn't show anything explicit about this, it just gets me the https version). Which firefox version are you using?

Maintenance_bot added a project: SRE.Mar 1 2023, 4:45 PM

Vgutierrez added a parent task: Restricted Task.Mar 1 2023, 4:45 PM

hashar updated the task description. (Show Details)Mar 1 2023, 4:47 PM

Thanks for the replies! Advising to use HTTPS over HTTP makes sense.

But not supporting redirection from HTTP to HTTPS will in my opinion introduce a fundamental problem for using Wikidata as a source for Linked Data. When querying Wikidata through the sparql endpoint the entities of the result set are all HTTP URIs. The RDF description of WD entities (accessed as described on https://www.wikidata.org/wiki/Wikidata:Data_access) contain many HTTP URIs for related entities and other resources.

Using the HTTP as identifier for the entity is not problematic as long as the redirection from HTTP to HTTPS can deliver access to the data itself.

I've noticed in the past few days that when I enter "wikidata.org" on my phone (using Vivaldi), it's sometimes really slow to load, but will load straightaway if I edit the URL to add https://. I don't remember whether it would eventually load if I waited or not.

Lydia_Pintscher added a project: Wikidata.Mar 2 2023, 9:15 AM

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptMar 2 2023, 9:15 AM

In T330906#8657917, @Ennomeijers wrote:

Thanks for the replies! Advising to use HTTPS over HTTP makes sense.

But not supporting redirection from HTTP to HTTPS will in my opinion introduce a fundamental problem for using Wikidata as a source for Linked Data. When querying Wikidata through the sparql endpoint the entities of the result set are all HTTP URIs. The RDF description of WD entities (accessed as described on https://www.wikidata.org/wiki/Wikidata:Data_access) contain many HTTP URIs for related entities and other resources.

Using the HTTP as identifier for the entity is not problematic as long as the redirection from HTTP to HTTPS can deliver access to the data itself.

IMHO that's a gap that needs to be closed on the sparql endpoint as including HTTP resources from a https resource is considered insecure: https://developer.mozilla.org/en-US/docs/Web/Security/Mixed_content.

That being said, on almost any recent browser traffic against wikidata.org or any other Wikimedia canonical domain will target port 443 (HTTPS) due to HTTP Strict Transport Security, a mechanism that allows us to ask the browsers to contact any of our domains via https even if the link or the user tries to reach the site via http

I think this touches upon a fundamental question of how to model WD information as Linked Data. As currently stated in https://www.wikidata.org/wiki/Wikidata:Data_access the concept URI of an entity is its HTTP version.

Accessing the data associated with the concept URI should be possible both for humans (through browsers) and for applications. Can you point me to examples for machine readable processing using the HTTP Strict Transport Security implementation or is this a browser only solution?

Probably HSTS is only being implemented by browsers.
There is any particular reason to target the HTTP version or it could be bumped to HTTPS? Considering that we don't serve traffic over port 80 besides than 301s for GET/HEAD requests and 403s for PUT/POST ones.

curl also implements HSTS. See https://curl.se/docs/hsts.html, but it is indeed primarily a mechanism to protect users of browsers.

@Ennomeijers you are right about this being a fundamental question. I think nobody expected HTTP to become disfavored anytime soon when that URI scheme was proposed, but then, things happened and here we are today, trying to only serve the absolutely necessary requests that @Vgutierrez mentioned above for maintaining compatibility while suggesting to everyone to migrate to HTTPS. I expect that this will cause more issues down the road. Probably many of them inadvertently, like this one. But, it's quite conceivable that it will be less and less supported gradually in the future.

In T330906#8657917, @Ennomeijers wrote:

Thanks for the replies! Advising to use HTTPS over HTTP makes sense.

But not supporting redirection from HTTP to HTTPS will in my opinion introduce a fundamental problem for using Wikidata as a source for Linked Data. When querying Wikidata through the sparql endpoint the entities of the result set are all HTTP URIs. The RDF description of WD entities (accessed as described on https://www.wikidata.org/wiki/Wikidata:Data_access) contain many HTTP URIs for related entities and other resources.

Using the HTTP as identifier for the entity is not problematic as long as the redirection from HTTP to HTTPS can deliver access to the data itself.

In T330906#8660183, @Ennomeijers wrote:

I think this touches upon a fundamental question of how to model WD information as Linked Data. As currently stated in https://www.wikidata.org/wiki/Wikidata:Data_access the concept URI of an entity is its HTTP version.

We don't have plans to get rid of port 80 HTTP->HTTPS redirects anytime soon. However, we consider that traffic pretty low priority, and in this particular case we partially disabled it temporarily while dealing with an operational incident, which luckily led to us uncovering this issue!

However, the canonical (i.e. "official", "should be used in all links") URIs for traffic/access to all Wikimedia project domains are HTTPS URIs, not HTTP ones. We shouldn't be publishing plain-HTTP URIs.

The HTTP->HTTPS redirects are designed to help smooth over issues with legacy links we don't control out in the wild Internet, when accessed by UAs that don't respect HSTS. These redirects, by their nature, are not secure. When users access content through plain-HTTP URIs, even though we try to serve a helpful 301 redirect to HTTPS, literally anyone on the Internet path between the user and the WMF can both see and modify both the request and the response in flight.

The initial, insecure request via plain-HTTP can be censored, surveilled, and modified. This means individual resources can be blocked/censored/replaced by bad actors. The article names you're reading can be catalogued to build profiles on readers. The intended 301 redirect can be replaced with something completely different, such as a redirect to a different site, an alternative version of our content, or even an attack payload or banner ad injection.

All of that aside, HTTP access is also going to perform worse, as you have to do the full TCP and HTTP transaction (multiple latency roundtrips) just to get the redirect response, then start over again with a fresh HTTPS transaction again on a fresh TCP connection (more redundant network roundtrips). Normal redirects that stay within one protocol and domainname can generally re-use the same connection, but not HTTP->HTTPS protocol upgrade redirects.

For all of these reasons, for Traffic purposes, all canonical URIs for our projects are HTTPS, not HTTP. We hadn't been aware that anything wikidata -related was publishing canonical URIs that start with http://, and we're collectively going to need to find a way to stop doing that.

Accessing the data associated with the concept URI should be possible both for humans (through browsers) and for applications. Can you point me to examples for machine readable processing using the HTTP Strict Transport Security implementation or is this a browser only solution?

HSTS is basically a legacy transition mechanism, much like the redirects, but both stronger and less-universal. It's defined in https://www.rfc-editor.org/rfc/rfc6797 . Its goal is to help paper over issues exactly like these - the first time you access https://www.wikidata.org/<anything>, you get an extra header that informs the UA that all future accesses to this whole domain should be upgraded to HTTPS without attempting plain HTTP at all. Further, there's a public "HSTS Preload" list at https://hstspreload.org/ that all modern browsers utilize, and which contains all of our domains. This avoids the problem of first access and HSTS caching, so that Preloading UAs transform even the first HTTP access to HTTPS before sending anything over the network.

It's not specific to browsers; it's implemented as some generic headers that are intended to be honored by any UA, but obviously many less-user-focused UAs (various HTTP library implementations for scripts, the curl CLI tool, etc) do not necessarily implement it strongly.

In T330906#8659810, @Vgutierrez wrote:

In T330906#8657917, @Ennomeijers wrote:

But not supporting redirection from HTTP to HTTPS will in my opinion introduce a fundamental problem for using Wikidata as a source for Linked Data. When querying Wikidata through the sparql endpoint the entities of the result set are all HTTP URIs. The RDF description of WD entities (accessed as described on https://www.wikidata.org/wiki/Wikidata:Data_access) contain many HTTP URIs for related entities and other resources.

IMHO that's a gap that needs to be closed on the sparql endpoint as including HTTP resources from a https resource is considered insecure: https://developer.mozilla.org/en-US/docs/Web/Security/Mixed_content.

The SPARQL endpoint doesn't include anything over HTTP, it only creates links.

As I already mentioned earlier, the SPARQL endpoint and the RDF serialized data all use the HTTP version as the canonical identifier. This makes sense to me and is, as far as I know, in line with other linked data best practices. But there needs to be a machine readable way to access the data.

Using a 301 to redirect to the HTTPS url is the correct approach and in fact this is already implemented and currently working again from my end. When I run the same command as mentioned in my first report I now do get a 301 reply. I hope this will keep working in this way until HTTP are no longer used within WD. I will close the issue for now.

bking subscribed.Mar 6 2023, 5:17 PM

In T330906#8661013, @Ennomeijers wrote:

As I already mentioned earlier, the SPARQL endpoint and the RDF serialized data all use the HTTP version as the canonical identifier. This makes sense to me and is, as far as I know, in line with other linked data best practices. But there needs to be a machine readable way to access the data.

Using a 301 to redirect to the HTTPS url is the correct approach and in fact this is already implemented and currently working again from my end. When I run the same command as mentioned in my first report I now do get a 301 reply. I hope this will keep working in this way until HTTP are no longer used within WD. I will close the issue for now.

Please don't close this task unless we're replacing it with a more-focused one on the uncovered issues here. For the reasons stated earlier, relying on the 301 to "fix" this is not the correct approach. We can open a separate new task if you prefer, but either way we need to get this properly addressed (by having all live links in our control use the proper canonical URIs via https://).

jbond subscribed.Mar 6 2023, 5:32 PM

Ok, I see your point. As long as the concept/canonical URIs for all entities are being published as http:// URIs there is no other way than following the 301 redirect. The core of the problem is the current way of publishing http:// URIs either through the sparql endpoint or directly in the data (by using .rdf like extension or content negotiation). The https vs http problem is a common issue in the Linked Data domain since the introduction of https (Schema.org suffers from similar problems).

Linked Data URIs are supposed to be stable over a long time. Just switching to https and block redirection can and will break numerous applications build on resolving the WD entities. In order to address this issue more clearly I would recommend to open a new issue for requesting the switch to canonical https URIs. I don't know the backend and the community well enough to suggest an approach to solve this problem. Leaving this discussion hidden under the title of current issue doesn't help its visibility.

The problem is identifiers vs urls. An identifier is stable. A url might not be. If you start using locators as identifiers.... things become gray.

Then again. The spec for html4 will for ever be: http://www.w3.org/TR/html4/strict.dtd and that too just redirects to https if you actually request it. I don't see the redirects as being a bad thing in that regard.

The redirects are neither good nor bad, they're instead both necessary (although that necessity is waning) and insecure. We thought we had standardized on all canonical URIs being of the secure variant ~8 years ago, and this oversight has flown under the radar since then, only to be exposed recently when we intentionally (for unrelated operational reasons) partially degraded our port 80 services.

I've made a new ticket, since that seems better all around. Let's move the rest of this discussion there.

HTTP URIs do not resolve from NL and DE?Closed, ResolvedPublicBUG REPORTActions

Description

Related ObjectsSearch...

Event Timeline

HTTP URIs do not resolve from NL and DE?
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...