Distributing Content

One of the really interesting things that I heard at FUDCon was loupgaroublond’s thoughts on distributing content.  Content has been a bit of a hot topic for fedora recently.  In the past month I’ve seen mailing list and IRC discussions on:

  • Whether rpm packages of the books on Project Gutenberg would be good for Fedora
  • How to decide whether someone’s random pictures of London deserves to be packaged in an rpm that people can download from Fedora as wallpapers
  • How to best manage package (or not rpm package) the documentation that the Fedora Docs project produces given that translations and which Fedora Release a document is for can cause an explosion of documents.

These are all attempts to manage content using rpm, software which was originally meant for packaging software.  There are several advantages to using the package manager:

  • It lets the system administrator easily see that certain packages of content are installed on the system via the tool that they already know.
  • Having it managed this way means that individual account holders don’t all have to download separate copies.  You canhave a single copy of the files on the computer.

However, there’s also many disadvantages including:

  • How do we avoid conflicts in package names?  This is especially prevelant in content like “london-pictures”.
  • How do we decide where to break up content?  Should my pictures of London and your pictures of London be forced into the same package or separate?  On the one extreme we will end up generating huge packages for users to download, on the other we explode the number of packages (and hence the meta-data associated with the repositories).
  • How do we provide people with the ability to search and browse the content?  If you compare deviantart, KDE Look, or flickr to the Fedora Package List you’ll see that the process of looking for content in the package list is definitely suboptimal.

What loupgaroublond is proposing as a GSoC project is to make an alternate way of hosting, delivering, and managing content that does not involve the packaging formats that we use for software. In the comments to loupgaroublond’s post, Kevin Kofler links to the Open Collaboration Services API that’s being used by opendesktop.org.  This looks like a good place to start for the delivery portion of the equation.  Since opendesktop drives gnome-look.org, kde-look.org, and others, they probably have some of the hosting equation worked out as well.  (Although I have a feeling that we could bring a lot to the table wrt managing mirror networks of the content via mirrormanager and other things we’ve developed purely for hosting the software that makes up Fedora).

However, that still leaves us needing to figure out how to manage the content once it’s on the computer.  This consists of at least two parts:

  1. A standard for how to organize and find the content on the system.  This probably includes a set place in the filesystem for installing content systemwide and another spot for installing software in the user’s home directory.  It would also include metadata that could be associated with the software and some standard APIs for accessing the information.
  2. A method of tracking what things are installed on the system.  loupgaroublond hints about this when he talks about a “content database system”.  For software packages in Fedora, rpm keeps a database in /var/lib/rpm that tracks what software is installed, metadata about the software, and where on the filesystem it lives.  We’d need something similar for tracking content with the addition that it needs to merge with tracking information stored in the user’s home directory.

Figuring out answers to these problems would be a worthy summer project that would benefit a broad spectrum of projects.  Unix-like distributions, would gain a way to deliver a hoard of free content.  Content authors would find a larger audience for their work.  Programmers would be provided with a means of sharing content resources on the system with each other and an API to access the content locally and possibly remotely.  System administrators would be able to manage free content similarly to how they manage free software on their systems.  End users should gain access to content from a consistent interface rather than having to traverse the internet, searching a variety of different websites with different UIs for what they’re looking for.

Any takers?  Please contact loupgaroublond using one of his preferred contact methods.


9 thoughts on “Distributing Content

  1. Pingback: Toshio Kuratomi: Distributing Content | TuxWire : The Linux Blog

  2. I still think this approach is totally misguided. It assumes without checking that the limiting factor is the packaging work, and that by removing the packaging steps all will be yummy. However:

    1. People who have actually packaged “content” systematically report packaging was not the problem (legal is usually the problem)

    2. There is huge value in being directly available in the distribution. The massive success of Arial and Times New Roman is solely linked to direct Windows availability

    People do *not* like to be redirected to web shops or web libraries, and web libraries are only good at making available a lot of dubious (legal, security or quality-wise) content without checking. A web project is only going to get faster at integrating material if you skip all the non-technical checks that make sure packages are sound legally and quality wise. May as well have a no-checks third-party rpm repository with the same properties and a lot less wasted efforts.

    As a bonus web projects move content all the time so even if you managed to find something you liked in there good luck finding it again a month later. (and if you want reliable web content location, well, you hit the *same* namespace choices as packaging-side)

    3. “code” has the same presentation constrains as “content” (see the frantic manual generation of games screenshots for the spins website), so we’ll need to add a presentation layer to distro tools *anyway* and it will happen faster without diverting resources to a web library project (the Open Font Library started about the same time or even earlier than the Fedora fonts sig. It is still not completely operational, reliable, or sane today. And a lot more effort and funding was poured in it that in our own packaging efforts)

    But code packagers do not like content, and people feel a web library is some sort of shortcut. Beware of shortcuts! They’re a good way to get hurt or lost.

    • Note that the review process for packages was not something that I wrote was a problem with our current content-in-software-repository. Although, now that you mention it, it is another thing that should be addressed. Content has some of the same concerns as code (legality of distribution, filesystem placement, scripts that may need to be run to register that content has been installed) but there are also a lot of things that make content different. For instance, there might be multiple versions of a piece of content that people have released. Some people may want to have the first version whereas others may want to have a later one. There’s no obvious bugfix, feature addition cycle that makes the later version more desirable. So it may be that a content repo should do a review on every submission unlike the Fedora Package Review process which lets software updates through without a review. OTOH, this also means that upgrade path is not a necessity for a content distribution format. We want to install all revisions of the content, not just one.

      Being directly available in the distribution is an interesting question. What constitutes being directly available? That the distribution has tools to search, manage, and install the content? A large number of the current content distributors are outside the distro because of two things: you have to go to them manually and you don’t know anything about the guarantees that the content makes (what are the allowable licenses? How do I make this into wallpaper for program Foo?) If Fedora ran a content repository or partnered with another upstream that had a review process for the content they hosted, and had the tools to install, manage, and search those repositories, what still separates them from the “available directly” category.

      As for namespacing, a content repository will have namespacing issues but they’re different limitations than rpms. For instance, you can get away with naming all the pictures in a content repository with unique, incrementing integers or a hash. The metadata and a thumbnail of the picture itself are what identify them to the person searching the repository. With rpm, the rpm package name is an important piece of metadata in and of itself.

      • 1. review and editorial checks is the huge issue proponents of web libraries paper over. A web library will quickly degenerate into a pile of illegal files, spam comments and sometimes malware unless it is strongly moderated. (this is why dafont is completely useless from a legal POW)

        And if it is strongly moderated it will be as manpower-expensive as direct packaging (With the *additional* cost of needing to set up a different infrastructure. A web site is easy. A highly available web site that can cope up with the load is not).

        If a web library is not sane and safe Fedora will refuse to link to it directly so the main objective (making stuff available easily) will be lost.

        2. No user cares about multiple versions of the same stuff, that’s over-engineering (or they care as much as multiple versions of the same code, with the same workarounds). The user view is very different from the author or packager view. The user view is it must just work without investing time (including without having to compare multiple versions). People who do checks like to have the history, but it’s a different kind of access.

        3. Unique uuids or hashes do not work. Because they’re “just an opaque cookie” they’re useless in search and you get a content soup without any structuring. Because they’re “just an opaque cookie” web developers like to change them randomly, making stuff move and users unable to find again what they liked a month earlier (this is a huge problem OFLB-side, for example)

        1. Once again, I’m not proposing that reviews should be abandoned
        2. With content, a new version is a new piece of content. Users do care about that.
        3. Search should be managed via metadata. Stable URLs being mantained should be a design goal which does not depend on the content having a human readable name or not.
  3. Also note that “the uploader will self-check” does not work. It ignores malicious people. But even non-malicious people will just declare what the system wants to see to push the files they want to share, even if what the system wants to see is different from reality (I’ve seen *many* commercial fonts declared “free” or even “GPL” in web libraries for this reason).

    People willing to be clean legally won’t have any problems packaging. People who do not care about packaging do not care investigating legal concerns either.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s