ATGeo WG: How do we represent place names?

Continuing the discussion from ATGeo WG: Let's look at an example `place` object?:

On the subject of specifying names[1] in place objects:

This seems reasonable when a place has the same name in multiple languages! I figure that lang(s) and priority should be optional anyway.

Conceivably, although you get into potentially weird situations where a place name might be different in different languages but each name might have the same “priority” for cultural reasons. This might be a place where it’s simpler to just have the convention to use the first name in the array that matches your chosen language and use case.

Interesting idea! I couldn’t find a description of ATProtocol or Lexicon language objects with a superficial Google search, but I did find this in the Lexicon docs[2]:

Language codes generally need to be parsed, normalized, and matched semantically, not simply string-compared.

I’m not sure I fully understand the implications of this statement, but wouldn’t it complicate matters to stuff the language tag into an object key?

If we go with this idea, and we bear in mind that a place could have multiple names in the same language[3], what would a more complex names array look like in practice? Could you give a more expansive example?


  1. ↩︎

  2. ↩︎

  3. ↩︎

Not too important, but I assume the convention is because it’d not be a breaking change to add more languages to the schema, and it means old code knows that it should account for the possibility in the future. This is something I like in API designs, so I’d be in favor of keep it like that even if we want to ship a single language right now.

One way to do this would be

{
  "names" : [
     {
        "text" : "San Francisco City Hall",
        "lang": ["en", "it", "ru"],
        "priorities:" {
          "en": 0,
          "ru": 0
          "it": 3, 
        }
     }
  ]
}

Although it there are standards for how to represent this as @ngerakines.me suggests, we should consider that. I don’t know how far those go but you could also do

{
  "names" : [
     {
        "lang:en": {
          "canonical": "San Francisco City Hall",
          "all": [
             {"name": "San Francisco City Hall", "priority": 0},
             {"name": "alternative name", "priority": 1},
             // ....
          ]
        }
     }
  ]
}

depends how much you want to bake ergonomics into the Lexicon design vs rely on libraries offering better typing for each language’s conventions. For example, I would hate if the way to find the most popular name in TS was.

names["lang:en"].find(option => option.priority == 0)

vs

// get most popular in language
names["lang:en"].canonical
// do operation that considers all priorities
names["lang:en"].all.sort(/* sorting function */)

But note that this design can introduce inconsistencies between the priority of a “canonical” and “all” (e.g. a bad record can have canonical differ from priority: 0), so we’d have to account for that… and so would any client consuming these. (edit: fairly easy by having rest instead of all)

Still:

  1. I don’t know how these names are accessed in real programming scenarios, so it’s a bit hard to balance concerns.
  2. I can imagine convenience libraries built on top if we want to keep the lexicon simpler/more “mistake proof” and still let have people have a good time doing common operations.

The main thing for me is we should absolutely not rely on convention, as much as possible. In a decentralized environment we don’t have a lot of control on whether people will read a spec or do research before just thoughtlessly putting data in whatever order.

Adding one more thing. Our preferred representation also depends on how the names are accessed, if we care about ergonomics.

The above makes it easy to go [language] => [details of name in language], while the initial design makes it so the easiest thing to do is [place name] => [info about that name]. It feels to me like the first one would be the most common case, but I defer to those that know more about the practice.

I think this last example is getting closer to something that balances completeness with extensibility with ergonomics.

For the sake of adding some additional perspectives, I submit the OpenStreetMap wiki page on Names, which offers a folksonomic approach to toponymy.

Some relevant examples include:

  • name: the default or primary name by which the place is known
  • name:xx: the primary name for language code xx (e.g. lang:en)
  • official_name[:xx]: the “official” name of the place when it differs from the commonly used name
  • alt_name[:xx]: a alternative name of some kind (possibly colloquial or historical)
  • old_name[:xx]: an alternate name of primarily historical significance
  • short_name[:xx]: an alternate name used as an abbreviation (e.g. NYC for New York City)
  • loc_name[:xx]: an alternate name that is purely colloquial or even slang (e.g. “San Fran” for San Francisco)

Aaron Cope in his Who’s On First project also advocates for a “base” name, which is a fallback 7-bit ASCII transliteration that can be used when all else fails.

I’m not proposing that we capture all of these use cases necessarily, but it does give some idea of the range of possibilities. The alternate names really come into play when you want to geocode something, and it matters to your use case to be able to meaningfully match “The Big Apple” to “New York City”.

So it seems like we want to readily capture “primary” versus “alternates”, and we want to capture language variations.

There’s also the use case of “I just want the normative name of the place in whatever language(s) the locals use”.

That’s probably good enough – I don’t think the goal is to build a database of the world here. That’s OSM’s job. But I think we should support a range of geocoding and rendering use cases, and organize the name properties in a way that is both ergonomic and hard to mess up.

Also, Overture Maps has its own abstraction of the OSM naming system with an explicit mapping between them. Here’s their example:

{
    primary: "New York",
    common: {
        "br": "Evrog Nevez",
        "el": "Νέα Υόρκη",
        "es": "Nueva York",
        "be-Latn-tarask": "Нью-Ёрк"
    },
    rules: [
        {
            "value": "City of New York",
            "variant": "official",
            "language": null,
        },
        {
            "value": "Nueva Ámsterdam",
            "variant": "alternate",
            "language": "es",
        },
        {
            "value": "Big Apple",
            "variant": "alternate",
            "language": null,
        },
        {
            "value": "La Gran Manzana",
            "variant": "alternate",
            "language": "es",
        },
    ]
}

I’m not suggesting we specifically adopt this model, but it does seem to have most of the desirable properties we’ve discussed in this thread.

Oh, interesting! It does indeed seem to account for those cases, although I’m unsure about it there being a primary in every case (would that be the name of the place in the official language of the place? Is there always one?).

Personally, and this is a developer perspective, I’d still like to have things split by language, which might or might not make sense.

Getting back your example, it would be:

{
  "names" : [
     {   
        "primary": "New York",
        "lang:en": {
          "common": "New York",
          "variants": [
             {"name": "City of New York", "type": "official", "priority": 0},
             // This originally has null in language, but i'm using it for example
             {"name": "Big Apple", "type": "alternate", "priority": 1},
             // ....
          ]
        "lang:es": {
          "common": "Nueva York",
          "variants": [
             {"name": "Nueva Ámsterdam", "type": "alternate", "priority": 0},
             {"name": "La Gran Manzana", "type": "alternate", "priority": 1},
             // ....
          ]
        },
        "lang:br": {
          "common": "Evrog Nevez",
          // None that we know of, array is empty
          "variants": []
        }
     }
  ]

What I’m unsure about:

  1. The concept of primary, as mentioned
  2. The significance of null in the language field
  3. What priority means when you have e.g. type official and type alternate in the same variants array.