Discussion:
[RFC] kbfiles: an extension to track binary files with less wasted bandwidth
(too old to reply)
Andrew Pritchard
2011-07-26 18:23:50 UTC
Permalink
The goal of kbfiles is to maintain the benefit of version tracking for binary
files without requiring clones and pulls to download versions of large,
incompressible files that will likely never be needed. These files are
replaced, according to the user's configuration, with small standin files
containing only the SHA1 sum of the binary file. Mercurial then tracks these
standin files, keeping history small, while the binary files are retrieved
only as needed (when updating, for example).

The reasoning behind this is that binary files are frequently large and already
compressed as part of their format, and as such, compressed diffs don't work
very well to track their changes. Since it is common for many types of
software development (game development being a particularly strong example) to
have large volumes of binary assets, without an extension like kbfiles, clones
can end up being a single many-gigabyte transaction, whereas kbfiles allows
this to be split into smaller transactions and avoid transferring most of the
data altogether. Kbfiles also avoids diffing the binary files, transferring
them as they are in any given revision. Finally, the size of data stored
locally is greatly decreased for common use cases, in which old versions
of binary assets are not often needed.

The typical use case is to have these binary files available on a central
server, though retrieving bfiles from both SSH and HTTP Mercurial repositories
is supported in the wire protocol. There are three locations that will be
checked to find the required big files:
- The repository-local cache, in .hg/kilnbfiles (this will be changed as needed
with the name of the extension);
- The configurable system cache, defaulting to $HOME/.kilnbfiles on POSIX-y
systems and AppData\Local\kilnbfiles on Windows; and
- The default or default-push remote paths in .hg/hgrc.

The system cache may be on network storage, so that an entire network of
developers may share their files over NFS or SMB.

When a file is committed as a bfile, it is copied to the repository-local cache
and to the system cache, and its standin is written in .kbf/. When pushing
changes to bfiles to a remote repository, any changed bfiles are uploaded with
the changesets. When pulling, though, only the changesets are transferred,
greatly reducing clone sizes for repositories containing heavily-edited binary
files. Then, when updating to a revision with changes to bfiles, the required
versions of the files are retrieved from either the system cache or the remote
repository.

kbfiles has several mechanisms for defending its repositories against damage
from non-kbfiles clients:
- add a 'kbfiles' line to .hg/requires in order to keep non-kbfiles clients
from breaking things;
- add a 'bfilestore' server capability, without which the client will not
attempt to interact with a remote repository when the local repository uses
kbfiles; and
- prepend 'kbfiles\n' to the output of the heads command when serving kbfiles
repositories to prevent non-kbfiles clients from creating broken clones.

The last of these is fairly likely to be controversial, but it currently seems
to be necessary. Although the HG19 bundle format as described on the wiki
would appear to solve the problem with its feature strings, it also does not
appear to be implemented yet. If and when it is, kbfiles will replace the
heads command hack with a 'kbfiles' bundle feature. Unfortunately, the result
is that non-kbfiles clients throw an exception with no mention of kbfiles, but
we could not find a way to make the client display a useful error message while
consistently preventing them from uploading changesets without the
corresponding bfiles or creating clones that are missing files.

As it stands, as long as either the client or the server has the current
version of kbfiles or either repo has been touched by the current version of
kbfiles, there are no known cases that cause missing bfiles.

The extension wraps most operations on repositories to handle bfiles specially;
this can be seen in bfsetup.py. It also explicitly handles cooperation with
several other extensions, including fetch, purge, and rebase.

Bfile transfer is implemented via three additions to the wire protocol on
servers with the extension loaded:
- statbfile, which returns 0, 1, or 2 depending on whether the requested bfile
(as identified by the SHA1 sum) is present and valid, invalid, or missing;
- getbfile, which returns the requested bfile along with its length to allow
the ssh protocol to avoid reading beyond its end (without modifying Mercurial
core code that attempts to encode passed-in file-like object as bundles); and
- putbfile, which hashes and verifies the received data and places it in the
repository-local and system caches.

The extension also currently supports talking to previous versions of Kiln that
still serve bfiles over a different interface, via POST and GET requests to
$REPO/bfile/$SHA. Although we would prefer to keep this in the extension, we
are able and willing to pull it out into its own meta-extension if necessary.

We are still in the process of cleaning up the code to ship with Mercurial, but
the current status can be seen at
http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files. Before the 'real'
pull request, we will collapse it into a single patch in the hgext directory.
Planned changes before then include removing compatibility shims for old
versions of Mercurial and some minor rebranding to remove mentions of 'Kiln'
from the code and repository layout.

We would prefer to avoid renaming the extension if possible, both to avoid
adding extra code to handle both old repositories and new ones and to reflect
the heritage of the extension, but we understand that parts of the Mercurial
community may be opposed to the name 'kbfiles', and as such we are willing to
rename to 'terafiles' if the name would otherwise block the extension from
shipping with Mercurial.
Adrian Buehlmann
2011-07-26 19:10:26 UTC
Permalink
Post by Andrew Pritchard
We are still in the process of cleaning up the code to ship with Mercurial, but
the current status can be seen at
http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files. Before the 'real'
pull request, we will collapse it into a single patch in the hgext directory.
Planned changes before then include removing compatibility shims for old
versions of Mercurial and some minor rebranding to remove mentions of 'Kiln'
from the code and repository layout.
We would prefer to avoid renaming the extension if possible, both to avoid
adding extra code to handle both old repositories and new ones and to reflect
the heritage of the extension, but we understand that parts of the Mercurial
community may be opposed to the name 'kbfiles', and as such we are willing to
rename to 'terafiles' if the name would otherwise block the extension from
shipping with Mercurial.
We got a request to support kbfiles in the TortoiseHg shell extension
(for Windows). I was a bit worried about seeing file paths like
".hg/kilnbfiles/dirstate" (see function openbfdirstate in
kbfiles/bfutil.py).

People may call me paranoid, but the reason why I'm a bit worried is
that the name "Kiln" is a registered trademark by FogCreek [1]. I do not
expect that they would ever forbid anyone using this name, but in theory
they could.

Frankly, I'd prefer not giving any company whatsoever the ability to
"pull the plug" on anything related to Mercurial.

[1] Registration number 3869331, United States Patent and Trademark
Office (see uspto.gov)
Na'Tosha Bard
2011-07-26 19:42:28 UTC
Permalink
Post by Andrew Pritchard
Post by Andrew Pritchard
We are still in the process of cleaning up the code to ship with
Mercurial, but
Post by Andrew Pritchard
the current status can be seen at
http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files. Before the
'real'
Post by Andrew Pritchard
pull request, we will collapse it into a single patch in the hgext
directory.
Post by Andrew Pritchard
Planned changes before then include removing compatibility shims for old
versions of Mercurial and some minor rebranding to remove mentions of
'Kiln'
Post by Andrew Pritchard
from the code and repository layout.
We would prefer to avoid renaming the extension if possible, both to
avoid
Post by Andrew Pritchard
adding extra code to handle both old repositories and new ones and to
reflect
Post by Andrew Pritchard
the heritage of the extension, but we understand that parts of the
Mercurial
Post by Andrew Pritchard
community may be opposed to the name 'kbfiles', and as such we are
willing to
Post by Andrew Pritchard
rename to 'terafiles' if the name would otherwise block the extension
from
Post by Andrew Pritchard
shipping with Mercurial.
We got a request to support kbfiles in the TortoiseHg shell extension
(for Windows). I was a bit worried about seeing file paths like
".hg/kilnbfiles/dirstate" (see function openbfdirstate in
kbfiles/bfutil.py).
People may call me paranoid, but the reason why I'm a bit worried is
that the name "Kiln" is a registered trademark by FogCreek [1]. I do not
expect that they would ever forbid anyone using this name, but in theory
they could.
Frankly, I'd prefer not giving any company whatsoever the ability to
"pull the plug" on anything related to Mercurial.
[1] Registration number 3869331, United States Patent and Trademark
Office (see uspto.gov)
My team uses kbfiles heavily and would like to see the extension shipped
with mercurial as well.

Regarding the name, I recall a discussion at the 1.9 sprint that suggested
"HugeFiles" would be a good alternative name (I recall discussing this with
Benjamin offlist, but I don't know what conclusion you guys came to on your
end about it). I started some work to do the renaming on my end to submit
upstream as a patch, but discovered Kiln has some problems when it doesn't
see a "kbfiles" extension enabled client-side, so I never got around to
finishing it. I can understand the concern Adrian has regarding FogCreek
having a trademark on Kiln, and also I recall some people thinking "kbfiles"
was a bit weird because it makes the user think of "kilobyte files" or
"files in the size of kilobytes", which clearly do not need such an
extension for usage in Mercurial.

Personally, I think "terafiles" also sounds quite strange, especially since
it seems likely that people familiar with the family of bfile-related
extensions will inadvertently say "tbfiles" :-)

Cheers,
Na'Tosha
Adrian Buehlmann
2011-07-28 08:50:06 UTC
Permalink
Post by Andrew Pritchard
Post by Andrew Pritchard
We are still in the process of cleaning up the code to ship with
Mercurial, but
Post by Andrew Pritchard
the current status can be seen at
http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files.
Before the 'real'
Post by Andrew Pritchard
pull request, we will collapse it into a single patch in the hgext
directory.
Post by Andrew Pritchard
Planned changes before then include removing compatibility shims
for old
Post by Andrew Pritchard
versions of Mercurial and some minor rebranding to remove mentions
of 'Kiln'
Post by Andrew Pritchard
from the code and repository layout.
We would prefer to avoid renaming the extension if possible, both
to avoid
Post by Andrew Pritchard
adding extra code to handle both old repositories and new ones and
to reflect
Post by Andrew Pritchard
the heritage of the extension, but we understand that parts of the
Mercurial
Post by Andrew Pritchard
community may be opposed to the name 'kbfiles', and as such we are
willing to
Post by Andrew Pritchard
rename to 'terafiles' if the name would otherwise block the
extension from
Post by Andrew Pritchard
shipping with Mercurial.
We got a request to support kbfiles in the TortoiseHg shell extension
(for Windows). I was a bit worried about seeing file paths like
".hg/kilnbfiles/dirstate" (see function openbfdirstate in
kbfiles/bfutil.py).
People may call me paranoid, but the reason why I'm a bit worried is
that the name "Kiln" is a registered trademark by FogCreek [1]. I do not
expect that they would ever forbid anyone using this name, but in theory
they could.
Frankly, I'd prefer not giving any company whatsoever the ability to
"pull the plug" on anything related to Mercurial.
[1] Registration number 3869331, United States Patent and Trademark
Office (see uspto.gov <http://uspto.gov>)
My team uses kbfiles heavily and would like to see the extension shipped
with mercurial as well.
Interesting.
Post by Andrew Pritchard
Regarding the name, I recall a discussion at the 1.9 sprint that
suggested "HugeFiles" would be a good alternative name (I recall
discussing this with Benjamin offlist, but I don't know what conclusion
you guys came to on your end about it). I started some work to do the
renaming on my end to submit upstream as a patch, but discovered Kiln
has some problems when it doesn't see a "kbfiles" extension enabled
client-side, so I never got around to finishing it. I can understand
the concern Adrian has regarding FogCreek having a trademark on Kiln,
and also I recall some people thinking "kbfiles" was a bit weird because
it makes the user think of "kilobyte files" or "files in the size of
kilobytes", which clearly do not need such an extension for usage in
Mercurial.
Personally, I think "terafiles" also sounds quite strange, especially
since it seems likely that people familiar with the family of
bfile-related extensions will inadvertently say "tbfiles" :-)
Honestly, I don't really care that much about the name of the extension
itself, as long as we don't get into troubles by using registered
trademarks. So, if we can avoid the word "Kiln", I'm perfectly fine.

Using the name "kbfiles" seems safe to me (regarding trademarks) and I
have no qualms about it having significant contributions by "the Kiln
folks", or its provenance in general. IIUC, the initial work was done by
Greg anyway.
Angel Ezquerra
2011-07-28 09:25:04 UTC
Permalink
Post by Adrian Buehlmann
Honestly, I don't really care that much about the name of the extension
itself, as long as we don't get into troubles by using registered
trademarks. So, if we can avoid the word "Kiln", I'm perfectly fine.
Using the name "kbfiles" seems safe to me (regarding trademarks) and I
have no qualms about it having significant contributions by "the Kiln
folks", or its provenance in general. IIUC, the initial work was done by
Greg anyway.
I personally like Matt's name proposal. "largefiles" is very
descriptive of what the extension does. As others said, "kbfiles"
could be confusing for users, since "kb" refers to kilobyte or
kilobit...
Obviously the FogCreek guys should have a final say, since its their
extension after all...

Angel
Adrian Buehlmann
2011-07-28 09:37:23 UTC
Permalink
Post by Angel Ezquerra
Post by Adrian Buehlmann
Honestly, I don't really care that much about the name of the extension
itself, as long as we don't get into troubles by using registered
trademarks. So, if we can avoid the word "Kiln", I'm perfectly fine.
Using the name "kbfiles" seems safe to me (regarding trademarks) and I
have no qualms about it having significant contributions by "the Kiln
folks", or its provenance in general. IIUC, the initial work was done by
Greg anyway.
I personally like Matt's name proposal. "largefiles" is very
descriptive of what the extension does. As others said, "kbfiles"
could be confusing for users, since "kb" refers to kilobyte or
kilobit...
Obviously the FogCreek guys should have a final say, since its their
extension after all...
I'm fine with the name "largefiles" as well. Just pick one that works.

I just wanted to say that I have no problems with keeping kbfiles
either. Andrew said they would like to keep kbfiles if possible.
Andrew Pritchard
2011-08-01 17:05:50 UTC
Permalink
Just a quick update on what's going on with what was once kbfiles:

I've changed it to use the SSH error stream and the HTTP content-type trick
along with moving the heads command to give non-kbfiles clients good error
messages.

I'm in the process of rebranding it to largefiles, as this seems to be the most
widely-accepted name. Although this entails adding conversion logic to the
extension to handle the "old" repositories, that logic won't be present in the
final version submitted to the core repo. Our extensions update themselves
automatically, so we can ship conversion logic for the intervening period and
be confident that any active repositories are converted by the time the next
version of Mercurial ships.

The Kiln-specific parts of the extension will be separated out into their own
meta-extension that we will ship with the other Kiln-specific extensions so as
to avoid polluting the core repository with legacy Kiln support. Fortunately,
this is no more difficult than adding commands to Mercurial.

As always, the current state can be seen at
https://developers.kilnhg.com/Repo/Kiln/Group/Unstable
Andrew Pritchard
2011-08-04 14:12:45 UTC
Permalink
The largefiles rename is done, and the Kiln-specific bits are now split out
into their own meta-extension. We have moved it out of the old repository into
a more appropriately-named one at
https://developers.kilnhg.com/Repo/Kiln/largefiles/largefiles. At this point,
there shouldn't be any remaining naming or copyright concerns, so the focus can
be placed on anything that would block largefiles' inclusion in Mercurial's
bundled extensions.
Na'Tosha Bard
2011-08-05 08:05:10 UTC
Permalink
Post by Andrew Pritchard
The largefiles rename is done, and the Kiln-specific bits are now split out
into their own meta-extension. We have moved it out of the old repository into
a more appropriately-named one at
https://developers.kilnhg.com/Repo/Kiln/largefiles/largefiles. At this point,
there shouldn't be any remaining naming or copyright concerns, so the focus can
be placed on anything that would block largefiles' inclusion in Mercurial's
bundled extensions.
OK, so now that the copyright stuff is out of the way, I think we have at
least the following issue left: documentation for users who are not using
Kiln. The one time we tried setting up the then-called kbfiles here at
Unity without Kiln, we were successful, but only after writing our own FTP
store. It seems like you've fixed this in largefiles, with a default store
option for people who don't have Kiln, but we need a nice gude for users to
actually set it up.

Also, I've been submitting fixes for bugs upstream by pushing to Unity's
branch for the old kbfiles repository. Do we need a new branch for the
largefiles repository? Additionally, can I have people on my team start
testing largefiles with Kiln, or will it break?

I wonder if Matt also has a list of stuff he wants addressed?

Cheers,
Na'Tosha
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Adrian Buehlmann
2011-08-05 09:09:03 UTC
Permalink
Post by Andrew Pritchard
The largefiles rename is done, and the Kiln-specific bits are now split out
into their own meta-extension. We have moved it out of the old repository into
a more appropriately-named one at
https://developers.kilnhg.com/Repo/Kiln/largefiles/largefiles. At this point,
there shouldn't be any remaining naming or copyright concerns, so the focus can
be placed on anything that would block largefiles' inclusion in Mercurial's
bundled extensions.
I just note that Greg also has

# Copyright 2009-2011 Intelerad Medical Systems Incorporated.

in his current copyright headers (most likely the company that paid him
for working on bfiles, as he says. I bet he has a copyright assignment
agreement with them for the original bfiles). Perhaps this should be
added as well?
Greg Ward
2011-08-04 14:29:15 UTC
Permalink
On Thu, Jul 28, 2011 at 5:25 AM, Angel Ezquerra
Post by Angel Ezquerra
Post by Adrian Buehlmann
Honestly, I don't really care that much about the name of the extension
itself, as long as we don't get into troubles by using registered
trademarks. So, if we can avoid the word "Kiln", I'm perfectly fine.
Using the name "kbfiles" seems safe to me (regarding trademarks) and I
have no qualms about it having significant contributions by "the Kiln
folks", or its provenance in general. IIUC, the initial work was done by
Greg anyway.
I personally like Matt's name proposal. "largefiles" is very
descriptive of what the extension does. As others said, "kbfiles"
could be confusing for users, since "kb" refers to kilobyte or
kilobit...
Obviously the FogCreek guys should have a final say, since its their
extension after all...
I couldn't let *this* pass by un-responded-to.

IT'S NOT THEIR EXTENSION.

I designed it, with help from my colleague Peter Neelin and (IIRC)
feedback from the mercurial-devel list. I wrote it. I tested it. I
reviewed patches from several contributors (thanks everyone!). I
mentored a couple of summer students that Fog Creek recruited in
summer 2010 to improve bfiles, and I massaged and merged in their
patches. It's my bloody extension, and Fog Creek forked it. You
wouldn't know that from reading the "hg log" of their repository, but
it's the truth.

And it should be noted that much of the work was done for my employer
on work time. So I have finally added appropriate copyright and
license statements to the bfiles source code:

http://hg.gerg.ca/hg-bfiles/rev/6f832a089582

Incidentally, yes I did get approval from my employer to release
bfiles publicly under the GPL, and I got that approval quite a while
ago -- spring 2010 I think?

My understanding of copyright law (at least in countries governed by
the Berne convention, including Canada [where I live and wrote bfiles]
and the US [where Fog Creek is based]) is that the source code has
been copyrighted as described by that patch from the second I wrote
it. Adding the formal statements to the code does not change the
copyright status of the work, it merely clarifies it.

Thus, I expect Fog Creek to add similar copyright statements to the
largefiles code. I also expect them to add a third line stating that
Fog Creek shares the copyright, because *of course they do*. They have
written plenty of code in that extension, and it sounds like they have
done good stuff with it. But it's not all their code, not by a long
shot.

Greg
Angel Ezquerra Moreu
2011-08-05 09:02:43 UTC
Permalink
Post by Greg Ward
On Thu, Jul 28, 2011 at 5:25 AM, Angel Ezquerra
Post by Angel Ezquerra
Post by Adrian Buehlmann
Honestly, I don't really care that much about the name of the extension
itself, as long as we don't get into troubles by using registered
trademarks. So, if we can avoid the word "Kiln", I'm perfectly fine.
Using the name "kbfiles" seems safe to me (regarding trademarks) and I
have no qualms about it having significant contributions by "the Kiln
folks", or its provenance in general. IIUC, the initial work was done by
Greg anyway.
I personally like Matt's name proposal. "largefiles" is very
descriptive of what the extension does. As others said, "kbfiles"
could be confusing for users, since "kb" refers to kilobyte or
kilobit...
Obviously the FogCreek guys should have a final say, since its their
extension after all...
I couldn't let *this* pass by un-responded-to.
IT'S NOT THEIR EXTENSION.
I designed it, with help from my colleague Peter Neelin and (IIRC)
feedback from the mercurial-devel list. I wrote it. I tested it. I
reviewed patches from several contributors (thanks everyone!). I
mentored a couple of summer students that Fog Creek recruited in
summer 2010 to improve bfiles, and I massaged and merged in their
patches. It's my bloody extension, and Fog Creek forked it. You
wouldn't know that from reading the "hg log" of their repository, but
it's the truth.
Greg,

I certainly did not want to upset you or diminish the credit that you
should get for your work.

What I should have said is that only Matt and those that contributed
to the extension should be the ones with a last say on the extension
name. I did not have a clear idea of the "genealogy" of this
extension. I knew that there were several competing "big files"
extensions, but I did not know the relationship between them. Hence my
mistake.

Anyway, it seems that the FogCreek guys are trying to address your
(and other people's) concerns. I hope they manage to do so to your
entire satisfaction.

I am actually quite excited by this development since this is one of
those things that people point out as a weakness of mercurial (and
other DVCS), even though in my experience this is a non issue in most
regular usage scenarios.

Cheers,

Angel
Na'Tosha Bard
2011-08-05 12:41:03 UTC
Permalink
On Fri, Aug 5, 2011 at 11:02 AM, Angel Ezquerra Moreu <
Post by Angel Ezquerra Moreu
I am actually quite excited by this development since this is one of
those things that people point out as a weakness of mercurial (and
other DVCS), even though in my experience this is a non issue in most
regular usage scenarios.
I am really thrilled about how well this project is moving along, as well.
Without a tool like this, Unity would not be using any DVCS (at least not
without having to have developed some similar system first). As it is, this
makes it possible for a company in the game development industry (an
industry where binaries on the order of several gigabites is the norm -- at
least we don't have any that big) to use Mercurial. Without bfiles, our
clone sizes would be increasing on the order of at *least* several hundred
megabytes or by some gigabytes every week.

This tool really opens the doors for a whole new group of users.

Cheers,
N.
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Martin Geisler
2011-08-05 17:00:13 UTC
Permalink
"Na'Tosha Bard" <***@unity3d.com> writes:

Hi Na'Tosha,

I'm back from vacation and slowly catching up on all the emails you guys
have sent the last two weeks :-)
Post by Na'Tosha Bard
I am really thrilled about how well this project is moving along, as
well. Without a tool like this, Unity would not be using any DVCS (at
least not without having to have developed some similar system first).
As it is, this makes it possible for a company in the game development
industry (an industry where binaries on the order of several gigabites
is the norm -- at least we don't have any that big) to use Mercurial.
Without bfiles, our clone sizes would be increasing on the order of at
*least* several hundred megabytes or by some gigabytes every week.
This tool really opens the doors for a whole new group of users.
I have a bit of feedback from my client -- the one who made the snap
extension. That extension is now being decommisioned and the department
that used it is looking for an alternative solution.

One problem they mentioned is the per-file overhead: Instead of having
200 files of 1 GB, they have 20,000 files, each of which is just 10 MB.
That is kind of upside-down compared to what I expected. How many big
files do you have in your repository?

From skimming the code, it seems that kbfiles opens a connection per
file that is sends to a remote store. If that is right, it sounds costly
when the number of files grow very large.
--
Martin Geisler

Mercurial links: http://mercurial.ch/
Augie Fackler
2011-08-05 18:33:57 UTC
Permalink
Post by Martin Geisler
Hi Na'Tosha,
I'm back from vacation and slowly catching up on all the emails you guys
have sent the last two weeks :-)
Post by Na'Tosha Bard
I am really thrilled about how well this project is moving along, as
well. Without a tool like this, Unity would not be using any DVCS (at
least not without having to have developed some similar system first).
As it is, this makes it possible for a company in the game development
industry (an industry where binaries on the order of several gigabites
is the norm -- at least we don't have any that big) to use Mercurial.
Without bfiles, our clone sizes would be increasing on the order of at
*least* several hundred megabytes or by some gigabytes every week.
This tool really opens the doors for a whole new group of users.
I have a bit of feedback from my client -- the one who made the snap
extension. That extension is now being decommisioned and the department
that used it is looking for an alternative solution.
One problem they mentioned is the per-file overhead: Instead of having
200 files of 1 GB, they have 20,000 files, each of which is just 10 MB.
That is kind of upside-down compared to what I expected. How many big
files do you have in your repository?
From skimming the code, it seems that kbfiles opens a connection per
file that is sends to a remote store. If that is right, it sounds costly
when the number of files grow very large.
It shouldn't be hard (for http, at least) to have some kind of
connection pool. I've written one already for our urlopen support
using mercurial.httpclient...
Post by Martin Geisler
--
Martin Geisler
Mercurial links: http://mercurial.ch/
_______________________________________________
Mercurial-devel mailing list
http://selenic.com/mailman/listinfo/mercurial-devel
Na'Tosha Bard
2011-08-05 20:24:29 UTC
Permalink
Hi Martin,
Post by Martin Geisler
Hi Na'Tosha,
I'm back from vacation and slowly catching up on all the emails you guys
have sent the last two weeks :-)
Post by Na'Tosha Bard
I am really thrilled about how well this project is moving along, as
well. Without a tool like this, Unity would not be using any DVCS (at
least not without having to have developed some similar system first).
As it is, this makes it possible for a company in the game development
industry (an industry where binaries on the order of several gigabites
is the norm -- at least we don't have any that big) to use Mercurial.
Without bfiles, our clone sizes would be increasing on the order of at
*least* several hundred megabytes or by some gigabytes every week.
This tool really opens the doors for a whole new group of users.
I have a bit of feedback from my client -- the one who made the snap
extension. That extension is now being decommisioned and the department
that used it is looking for an alternative solution.
One problem they mentioned is the per-file overhead: Instead of having
200 files of 1 GB, they have 20,000 files, each of which is just 10 MB.
That is kind of upside-down compared to what I expected. How many big
files do you have in your repository?
We apparently have 31 files in our main repository, but I have some other,
less used ones, that have several hundred. None have proven to be a
problem, but you do raise a good point. I'd like to see Benjamin or
Andrew's feedback.

Cheers,
Na'Tosha
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Martin Geisler
2011-08-06 10:43:35 UTC
Permalink
Post by Na'Tosha Bard
Hi Martin,
Post by Martin Geisler
One problem they mentioned is the per-file overhead: Instead of
having 200 files of 1 GB, they have 20,000 files, each of which is
just 10 MB. That is kind of upside-down compared to what I expected.
How many big files do you have in your repository?
We apparently have 31 files in our main repository, but I have some other,
less used ones, that have several hundred.
Okay, that was also the kind of numbers I had imagined: ~50 files of
maybe 100+ MB.
Post by Na'Tosha Bard
None have proven to be a problem, but you do raise a good point. I'd
like to see Benjamin or Andrew's feedback.
Me too :-)
--
Martin Geisler

Mercurial links: http://mercurial.ch/
Andrew Pritchard
2011-08-06 18:17:05 UTC
Permalink
As for documentation, we (or at least I) have been putting it off until
largefiles is closer to release - at the moment there are still a few
outstanding bugs and plenty of internal testing to do. Nonetheless, it is
pretty simple to use a non-Kiln store: simply serve via hgweb or ssh with the
largefiles extension enabled, and everything should work appropriately. There
are still some concerns about the more distributed way it can work now, because
it will always look for largefiles on the default path, and it might be
appropriate to add a config option for a default store separate from the
default-push paths.

At the moment, largefiles' branching is somewhat confusing, since we have one
repository containing what should be incorporated into Mercurial and a separate
repository for what we will ship with the Kiln Extensions in order to aid
migrating repositories to the newer layout. As such, fixes towards largefiles
in general are going into the 'largefiles' repo, and work on migration code
and Kiln-specific things are going into the 'largefiles-kiln' repo.
Unfortunately, this looks likely to break down as soon as we start stripping
compatibility for old versions of Mercurial from the 'largefiles' repo, as we
don't want to merge anti-backwards-compat changes into the Kiln version, but we
will still want to pull bugfixes and feature additions. As the two diverge, we
will probably add another repository for changes we want in both, and we can
add a branch repository there for Unity's contributions.

As for testing with Kiln, we have split out the Kiln communication code into a
'kilnstore' extension, whose repository is in the same place as the largefiles
ones. It looks for largefiles and monkey-patches in the code for talking to
Kiln's kbfiles routes. With both largefiles and kilnstore enabled, there
_shouldn't_ be any problems, but not very many people have been using the
latest version (since the rename) - possibly only me, in fact - so it's fairly
likely to have problems. Three things have changed in the repository storage
along with the name: the 'kbfiles' requirement is now 'largefiles', the
'.hg/kilnbfiles' directory is now '.hg/largefiles', and the '.kbf' directory is
now '.hglf'. The last is mostly because the '.hg*' prefix is traditionally
considered reserved for Mercurial's use and is substantially less likely to
collide with anyone's normal files. The first two are handled transparently by
largefiles-kiln's migration code, which just renames the directory and changes
the requirement if the old one is present. The other is used transparently
by largefiles-kiln, in that repositories with '.kbf' standins still work, but
they cannot be transparently migrated because the changeset nodeids would
change. As of right now, the only ways to migrate are using lfconvert to
convert via a normal repository or using the convert extension with a filemap
to rename the .kbf directories. An actual migration command is coming.

Largefiles does currently open large numbers of connections to download needed
files, which I have recently discovered to be particularly annoying when IIS
decides it needs a full second to decide which HTTP handler to call for a
request. This could be alleviated with not too much difficulty by two changes:
first, make 'statlfile' batchable; and second, turn 'getlfile' into
'getlfiles', sending multiple files in one connection, either in an ad-hoc
line-based protocol like the Mercurial's ssh transport, or in a tar archive.
The same could be done for 'putlfile'.
Laurens Holst
2011-08-08 10:09:28 UTC
Permalink
Post by Andrew Pritchard
Largefiles does currently open large numbers of connections to download needed
files, which I have recently discovered to be particularly annoying when IIS
decides it needs a full second to decide which HTTP handler to call for a
first, make 'statlfile' batchable; and second, turn 'getlfile' into
'getlfiles', sending multiple files in one connection, either in an ad-hoc
line-based protocol like the Mercurial's ssh transport, or in a tar archive.
The same could be done for 'putlfile'.
Why not just open a couple of HTTP connections simultaneously? That
would alleviate the problem and keep the interface simple.

~Laurens
Matt Mackall
2011-07-26 20:23:06 UTC
Permalink
Post by Andrew Pritchard
kbfiles has several mechanisms for defending its repositories against damage
- add a 'kbfiles' line to .hg/requires in order to keep non-kbfiles clients
from breaking things;
- add a 'bfilestore' server capability, without which the client will not
attempt to interact with a remote repository when the local repository uses
kbfiles; and
- prepend 'kbfiles\n' to the output of the heads command when serving kbfiles
repositories to prevent non-kbfiles clients from creating broken clones.
The last of these is fairly likely to be controversial, but it currently seems
to be necessary. Although the HG19 bundle format as described on the wiki
would appear to solve the problem with its feature strings, it also does not
appear to be implemented yet. If and when it is, kbfiles will replace the
heads command hack with a 'kbfiles' bundle feature. Unfortunately, the result
is that non-kbfiles clients throw an exception with no mention of kbfiles, but
we could not find a way to make the client display a useful error message while
consistently preventing them from uploading changesets without the
corresponding bfiles or creating clones that are missing files.
Ok, so the issues are:

a) we don't want clients to get incomplete/broken/bogus check-outs
b) we don't want clients to fail to push big files back to servers
c) we (probably?) don't want clients to convert big files back into very
large normal files and then push them again

On the other hand, we probably don't want to break the entire protocol.

So we want to cleanly refuse push and pull to clients who don't identify
themselves as big file users. I think we can probably manage this, and
still work with old clients:

$ hg in http://selenic.com/fail.bin
real URL is http://www.selenic.com/fail.bin
abort: 'http://www.selenic.com/fail.bin' does not appear to be an hg
repository:
---%<--- (application/octet-stream)
oops!

You're trying to pull from a server that requires the you to have the
foo extension enabled.

---%<---
!

Support for this goes back as far as 1.4. Generating a similar ssh
banner should be even easier as we have an independent error stream.

Unfortunately, having the server decide whether or not to serve a client
based on _client_ capabilities is something we've carefully avoided up
to this point: all clients should be capable of reading from all
servers, and the client is supposed to make all the decisions based on
reported server capabilities. So the client never advertises its
capabilities to the server because the server doesn't care.

So the server needs to advertise "bigfiles" and then _move_ the existing
push/pull commands and replace them so that any client that uses the old
commands gets the error messages.
Post by Andrew Pritchard
The extension also currently supports talking to previous versions of Kiln that
still serve bfiles over a different interface, via POST and GET requests to
$REPO/bfile/$SHA. Although we would prefer to keep this in the extension, we
are able and willing to pull it out into its own meta-extension if necessary.
I guess this is for versions of Kiln that exist outside of your control?

Moving an extension into the main repo is pretty much the last point at
which we get to break backward compatibility and drop legacy support, so
I would ask you to seriously consider taking this opportunity to
jettison anything you don't want to support long-term.
Post by Andrew Pritchard
We are still in the process of cleaning up the code to ship with Mercurial, but
the current status can be seen at
http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files. Before the 'real'
pull request, we will collapse it into a single patch in the hgext directory.
Planned changes before then include removing compatibility shims for old
versions of Mercurial and some minor rebranding to remove mentions of 'Kiln'
from the code and repository layout.
We would prefer to avoid renaming the extension if possible, both to avoid
adding extra code to handle both old repositories and new ones and to reflect
the heritage of the extension, but we understand that parts of the Mercurial
community may be opposed to the name 'kbfiles', and as such we are willing to
rename to 'terafiles' if the name would otherwise block the extension from
shipping with Mercurial.
I don't particularly object to Kiln part of the heritage being visible
and documented (though we also shouldn't lose track of Greg Ward's
contribution here!). I note from the repo that there's a shortage of
copyright headers, we'll want to get some on there.

But I think the name is liable to be a source of confusion:

- unlike the original 'bigfiles', its purpose isn't immediately obvious
- for a while at least, it won't be clear from bug reports which kbfiles
we're talking about and who's responsible for it
- as I've mentioned before, 'kb' actually implies -small- files!

I don't think 'terafiles' is ideal here either. How about simply
'largefiles'? It's not taken already and is clearly distinct from the
existing bigfiles/bfiles/kbfiles.
--
Mathematics is the supreme nostalgia of our time.
Greg Ward
2011-08-04 14:13:53 UTC
Permalink
Finally catching up on this thread -- sorry for the delay.
Post by Andrew Pritchard
The goal of kbfiles is to maintain the benefit of version tracking for binary
files without requiring clones and pulls to download versions of large,
incompressible files that will likely never be needed.  These files are
replaced, according to the user's configuration, with small standin files
containing only the SHA1 sum of the binary file.  Mercurial then tracks these
standin files, keeping history small, while the binary files are retrieved
only as needed (when updating, for example).
Gee, this sounds familiar. Did I write that? No, that's actually a
good paraphrase of my words and ideas.
Post by Andrew Pritchard
The reasoning behind this is that binary files are frequently large and already
compressed as part of their format, and as such, compressed diffs don't work
very well to track their changes.
That also sounds familiar, but again it's a good paraphrasing.
Post by Andrew Pritchard
When a file is committed as a bfile, it is copied to the repository-local cache
and to the system cache, and its standin is written in .kbf/.  When pushing
changes to bfiles to a remote repository, any changed bfiles are uploaded with
the changesets.  When pulling, though, only the changesets are transferred,
greatly reducing clone sizes for repositories containing heavily-edited binary
files.  Then, when updating to a revision with changes to bfiles, the required
versions of the files are retrieved from either the system cache or the remote
repository.
kbfiles has several mechanisms for defending its repositories against damage
- add a 'kbfiles' line to .hg/requires in order to keep non-kbfiles clients
 from breaking things;
- add a 'bfilestore' server capability, without which the client will not
 attempt to interact with a remote repository when the local repository uses
 kbfiles; and
- prepend 'kbfiles\n' to the output of the heads command when serving kbfiles
 repositories to prevent non-kbfiles clients from creating broken clones.
Good stuff. These are things I have never addressed in bfiles, and
that need to be addressed. I'm glad you've taken care of them.
Post by Andrew Pritchard
Bfile transfer is implemented via three additions to the wire protocol on
- statbfile, which returns 0, 1, or 2 depending on whether the requested bfile
 (as identified by the SHA1 sum) is present and valid, invalid, or missing;
- getbfile, which returns the requested bfile along with its length to allow
 the ssh protocol to avoid reading beyond its end (without modifying Mercurial
 core code that attempts to encode passed-in file-like object as bundles); and
- putbfile, which hashes and verifies the received data and places it in the
 repository-local and system caches.
This also sounds better than bfiles -- I never touched Mercurial's
wire protocol.
Post by Andrew Pritchard
The extension also currently supports talking to previous versions of Kiln that
still serve bfiles over a different interface, via POST and GET requests to
$REPO/bfile/$SHA.  Although we would prefer to keep this in the extension, we
are able and willing to pull it out into its own meta-extension if necessary.
I think Matt is right: now is the time to jettison
backwards-compatibility legacy code, even if it makes life harder for
people like me (using bfiles, looking very carefully at
kbfiles/largefiles to see if switching is a win).
Post by Andrew Pritchard
We are still in the process of cleaning up the code to ship with Mercurial, but
the current status can be seen at
http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files.  Before the 'real'
pull request, we will collapse it into a single patch in the hgext directory.
That's a *terrible* idea! You should preserve history!

Actually, your existing kbfiles repository already discards history.
Fog Creek has conveniently collapsed all of *my* work, plus an unknown
amount of forking and hacking, into a single large revision 0. That's
just wrong. It's morally wrong because it deprives the original author
(me) of public credit for his work. And it's technically wrong because
it makes it much harder to trace a given line of code back in history.

So that gets me to my first gripe with kbfiles/largefiles, which is
that you (Fog Creek) have almost completely erased the record of my
contribution. It's one thing to fork a project for your own needs, but
it's something else entirely to erase the origins of that code from
the historical record. I have no objection to the fork. I am a bit
unhappy that you have not tried very much to contribute changes back
upstream (i.e. to me). But I am most unhappy that you have nearly
erased me from the history. That's not cool.

Luckily, it's fixable: start with a clone of bfiles, possibly
truncated at your fork point, alongside your private internal
repository. Apply patches from your internal repo to the bfiles clone.
Then apply patches from the public kbfiles repo. End result: a
legitimate repository that captures the true history of the project,
without erasing anyone's contribution. Final step: rename things into
hgext/largefiles so the whole thing can be pulled into Mercurial.

Finally, I have two *technical* objections: the use of dirstate and
the use of standin files. I know, it's pretty rich for *me* to
criticise kbfiles/largefiles for using my design. But I'm in a pretty
good position to know where I got things wrong.

First, the use of a dedicated dirstate for big files was dumb and lazy
on my part. Big files have a different life-cycle from regular files,
and trying to shoehorn them into a separate dirstate instance just
doesn't work very well. I think the right thing to do is 1) draw a
diagram of the complete life-cycle of big files, 2) implement a custom
data structure (similar idea to dirstate) that tracks that life-cycle,
3) ditch the current hodge-podge of state-tracking mechanisms. I
haven't got very far on this, since bfiles is now a
weekends-and-evenings (and occasional quiet days at work) project for
me. Anyways, this is fixable by just dedicating some programmer time
to the problem.

Second, I'm not convinced that the fundamental design of bfiles -- the
use of standin files -- is appropriate. It complicates things a lot,
and the main reason I chose it was to allow partial bfupdate -- i.e.
don't make me fetch all of the big files in my working directory; I
just want to fetch some of them. I still think that's a nice feature,
but I wonder if it's worth the complication.

The approach taken by 'snap', where the big file hashes are stored
right in file revlogs, sounds interesting. I peeked at the code for
snap once, and was put off by the sheer volume of code. But it's an
interesting idea.

Alas, I'm not sure this is fixable. There are people out there in the
real world using bfiles and/or kbfiles/largefiles, and changing the
fundamental design would break all of their repositories and bfile
stores. ;-(

Oh yeah, for the record, I like the name 'largefiles'.

Greg
Adrian Buehlmann
2011-08-04 14:50:53 UTC
Permalink
Post by Greg Ward
Finally catching up on this thread -- sorry for the delay.
Post by Andrew Pritchard
We are still in the process of cleaning up the code to ship with Mercurial, but
the current status can be seen at
http://developers.kilnhg.com/Repo/Kiln/Group/Unstable/Files. Before the 'real'
pull request, we will collapse it into a single patch in the hgext directory.
That's a *terrible* idea! You should preserve history!
Actually, your existing kbfiles repository already discards history.
Fog Creek has conveniently collapsed all of *my* work, plus an unknown
amount of forking and hacking, into a single large revision 0. That's
just wrong. It's morally wrong because it deprives the original author
(me) of public credit for his work. And it's technically wrong because
it makes it much harder to trace a given line of code back in history.
So that gets me to my first gripe with kbfiles/largefiles, which is
that you (Fog Creek) have almost completely erased the record of my
contribution. It's one thing to fork a project for your own needs, but
it's something else entirely to erase the origins of that code from
the historical record. I have no objection to the fork. I am a bit
unhappy that you have not tried very much to contribute changes back
upstream (i.e. to me). But I am most unhappy that you have nearly
erased me from the history. That's not cool.
What I would find really uncool, is removing copyright headers in the
source files. But it looks like this is not what happened here (IIUC).

It looks to me like there were no copyright headers (in the individual
files), which - sorry Greg - looks to me like a bit of Greg's own fault.

I see Greg has added copyright headers ~1 hour ago now:
http://hg.gerg.ca/hg-bfiles/rev/6f832a089582

So I think Greg's current copyright headers should be merged into the
largefiles sources.

I don't think the full history must be preserved. It can't be preserved
anyway, if this extension will be included in Mercurial.
Benjamin Pollack
2011-08-04 14:44:22 UTC
Permalink
Post by Greg Ward
Actually, your existing kbfiles repository already discards history.
Fog Creek has conveniently collapsed all of *my* work, plus an unknown
amount of forking and hacking, into a single large revision 0. That's
just wrong. It's morally wrong because it deprives the original author
(me) of public credit for his work. And it's technically wrong because
it makes it much harder to trace a given line of code back in history.
So that gets me to my first gripe with kbfiles/largefiles, which is
that you (Fog Creek) have almost completely erased the record of my
contribution. It's one thing to fork a project for your own needs, but
it's something else entirely to erase the origins of that code from
the historical record. I have no objection to the fork. I am a bit
unhappy that you have not tried very much to contribute changes back
upstream (i.e. to me). But I am most unhappy that you have nearly
erased me from the history. That's not cool.
I'll let Andrew respond to the rest of this email later, but I felt
that I had to respond to this part.

You're assuming a lot of malice where there is none.

Prior to the push that we've been doing over the last few months, our
fork of bfiles was extremely tightly integrated into Kiln. As such,
it was part of the general Kiln repository at Fog Creek, and we didn't
see a lot of point in submitting a bunch of Kiln-specific changes
upstream to you. There weren't honestly many core changes, past the
ones that we did submit on this mailing list as part of our
sponsorship of Mercurial hacking at the University of Toronto, that
weren't related to having a more automated interaction with Kiln.

When, several months ago, we decided to break these changes out again
to try to get them included into core Mercurial, we didn't have a
history of (k)bfiles that was easy to separate from the rest of Kiln.
So we truncated the history, made a new repository so that we could
use kbfiles as a subrepository, and have been hacking on that. Since
Mercurial has not pulled in the history of another repository in a
*long* time (I see it happening once, back in 2006), I didn't see a
problem with this, because I assumed we'd be submitting a single patch
at the end of the day anyway.

One thing missing from the repository right now, which we just
discussed yesterday, is a CONTRIBUTORS file. That should include you,
and the University of Toronto, and the UCOSP program, and Unity 3D,
and Fog Creek, and many others who have contributed. We'll also need
to fix the copyright headers, exactly as you mention in a previous
email. But this was oversight, not malice.

We'll fix the CONTRIBUTORS file today, and Andrew has a patch that he
may have already pushed out that fixes the copyright headers on all of
the files. If you see any we missed, or if you know we've omitted
someone from the contributors file, please let us know. But you are
assuming a level of maliciousness here that I'm somewhat baffled by.

--Benjamin
Na'Tosha Bard
2011-08-04 15:07:30 UTC
Permalink
Post by Benjamin Pollack
One thing missing from the repository right now, which we just
discussed yesterday, is a CONTRIBUTORS file. That should include you,
and the University of Toronto, and the UCOSP program, and Unity 3D,
I'll just step my toe in here to mention that our company's name is Unity
Technologies, not Unity 3D (although I am actually guilty of saying Unity 3D
myself). Unity 3D is the actual game development tool we make.

// Me slips quietly back out the door . ..

Cheers,
N.
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Matt Mackall
2011-08-04 15:15:10 UTC
Permalink
Post by Benjamin Pollack
Since
Mercurial has not pulled in the history of another repository in a
*long* time (I see it happening once, back in 2006), I didn't see a
problem with this, because I assumed we'd be submitting a single patch
at the end of the day anyway.
Yep, this is a correct assumption. Just because it's possible to merge
unrelated repositories doesn't mean it's a great idea and I frankly
regret having done it for hgk. I think it would be better to instead to
keep a separate repo for archaeological purposes.

That said, I absolutely DO want to credit everyone who's contributed in
the final merged source.
--
Mathematics is the supreme nostalgia of our time.
Na'Tosha Bard
2011-08-04 15:45:10 UTC
Permalink
Hi Benjamin,
Post by Benjamin Pollack
We'll fix the CONTRIBUTORS file today, and Andrew has a patch that he
may have already pushed out that fixes the copyright headers on all of
the files.
Thanks for adding this. My employer has paid me a lot of money to work on
this extension and is pleased to be sharing the recognition for the work.

Small note: please correct the name of our company (Unity Technologies) in
the CONTRIBUTORS file.

Regarding the copyright headers; I don't think it's fair to say they are
"Copyrighted by FogCreek". I think technically we all own copyrights on
the parts of the software we wrote, so it should probably say:

"Copyright 2010-2011 Gregory P. Ward, Fog Creek Software, and others (see
CONTRIBUTORS file)"

or similar. For future contributors, you just need to make sure they get
added to the CONTRIBUTORS file.

Cheers,
Na'Tosha
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Andrew Pritchard
2011-08-04 19:21:31 UTC
Permalink
We've fixed Unity Technologies' name in the CONTRIBUTORS file, and we've added
Greg Ward and Unity Technologies to the copyright headers of files that have
seen their influence. One thing to point out is that the last time we pulled
any of Greg's changes into our fork was in 2010, so that is when the date range
listed on his copyright line ends. The only files whose copyright headers
don't include him are remotestore.py, which was written since our fork
diverged, and proto.py and wirestore.py, which I personally wrote from scratch
in the last month. The license is identical to that specified in Greg's
newly-added copyright headers: GPLv2 or any later version.
Greg Ward
2011-08-07 22:31:32 UTC
Permalink
Post by Benjamin Pollack
Post by Greg Ward
So that gets me to my first gripe with kbfiles/largefiles, which is
that you (Fog Creek) have almost completely erased the record of my
contribution. It's one thing to fork a project for your own needs, but
it's something else entirely to erase the origins of that code from
the historical record. I have no objection to the fork. I am a bit
unhappy that you have not tried very much to contribute changes back
upstream (i.e. to me). But I am most unhappy that you have nearly
erased me from the history. That's not cool.
I'll let Andrew respond to the rest of this email later, but I felt
that I had to respond to this part.
You're assuming a lot of malice where there is none.
I was pretty sure there was no malice involved, just history
truncation. And I could hardly call myself a version control geek if I
was *not* offended by history truncation.
Post by Benjamin Pollack
Prior to the push that we've been doing over the last few months, our
fork of bfiles was extremely tightly integrated into Kiln.  As such,
it was part of the general Kiln repository at Fog Creek, and we didn't
see a lot of point in submitting a bunch of Kiln-specific changes
upstream to you.
Ahh, OK, now it makes more sense.
Post by Benjamin Pollack
 There weren't honestly many core changes, past the
ones that we did submit on this mailing list as part of our
sponsorship of Mercurial hacking at the University of Toronto,
Thanks again for sponsoring that work!
Post by Benjamin Pollack
When, several months ago, we decided to break these changes out again
to try to get them included into core Mercurial, we didn't have a
history of (k)bfiles that was easy to separate from the rest of Kiln.
So we truncated the history, made a new repository so that we could
use kbfiles as a subrepository, and have been hacking on that.  Since
Mercurial has not pulled in the history of another repository in a
*long* time (I see it happening once, back in 2006), I didn't see a
problem with this, because I assumed we'd be submitting a single patch
at the end of the day anyway.
1) You could probably use 'hg convert' with a filemap to extract the
history of kbfiles from Kiln's history.
2) But maybe that's not worth the bother, since Matt is unlikely to
accept a request to pull the entire history of largefiles.
3) But then Matt said we should keep a separate repo for archaelogical
reasons, and IMHO that repo should be bfiles + kbfiles + largefiles
for best accuracy. As long as there are no changesets in the
Kiln-specific history of kbfiles where someone at Fog Creek
accidentally committed nuclear launch codes or the passwords to your
public web server, there's no technical reason not to do that.
Post by Benjamin Pollack
One thing missing from the repository right now, which we just
discussed yesterday, is a CONTRIBUTORS file.  That should include you,
and the University of Toronto, and the UCOSP program, and Unity 3D,
and Fog Creek, and many others who have contributed.  We'll also need
to fix the copyright headers, exactly as you mention in a previous
email.  But this was oversight, not malice.
The lack of copyright headers is also my fault. I specifically did not
complain about that, because I knew damn well who hadn't gotten around
to adding copyright headers: me!

Also, the CONTRIBUTORS file has a small error: I am *not* the author
of the bigfiles extension. I am the author of bfiles.

BTW, if you guys need some documentation, the bfiles repo has it. I
don't know how much applies to largefiles, but it sounds like a lot of
it.

Greg
Adrian Buehlmann
2011-08-20 06:43:11 UTC
Permalink
Post by Greg Ward
Post by Benjamin Pollack
Post by Greg Ward
So that gets me to my first gripe with kbfiles/largefiles, which is
that you (Fog Creek) have almost completely erased the record of my
contribution. It's one thing to fork a project for your own needs, but
it's something else entirely to erase the origins of that code from
the historical record. I have no objection to the fork. I am a bit
unhappy that you have not tried very much to contribute changes back
upstream (i.e. to me). But I am most unhappy that you have nearly
erased me from the history. That's not cool.
I'll let Andrew respond to the rest of this email later, but I felt
that I had to respond to this part.
You're assuming a lot of malice where there is none.
I was pretty sure there was no malice involved, just history
truncation. And I could hardly call myself a version control geek if I
was *not* offended by history truncation.
I have to report that Fog Creek just recently did a history truncation
on a separate open source project. But in my view, this time completely
uneeded.

It's on the TortoiseHg project.

On thg, we very recently split off the so-called "shell extension for
Windows" into its own repository at

https://bitbucket.org/tortoisehg/shellext

after Fog Creek had proposed to modify the TortoiseHg shell extension
(on the stable branch, right after the last major release) for the
kbfiles mercurial extension, after which I had proposed to first try to
get kbfiles (now largfiles) into Mercurial's tree (which is happening
here now). [1]

As it happens, Steve managed to carefully keep the history of the
TortoiseHg shell extension by doing a filename.based split conversion
from the TortoiseHg repo.

Instead of at least doing a proper repository fork, Fog Creek completely
started the history from scratch at

https://developers.kilnhg.com/Repo/Kiln/TortoiseHg/Shell-Extension

copying the sources of the original shellext.

For what reason? I really can't see any in this particular case.

In any case it sure makes it harder to backport bugfixes they do in
their fork into the thg shellext (example [2]).

[1] Full disclosure: I wasn't keen on tying the TortoiseHg
shell extension to the intimate details of kbfiles. So it didn't
happen (but I'm open to make it happen for largfiles, when it
makes its way into Mercurial).
[2] https://bitbucket.org/tortoisehg/shellext/issue/3
Benjamin Pollack
2011-08-20 14:20:26 UTC
Permalink
Post by Adrian Buehlmann
Instead of at least doing a proper repository fork, Fog Creek completely
started the history from scratch at
https://developers.kilnhg.com/Repo/Kiln/TortoiseHg/Shell-Extension
copying the sources of the original shellext.
For what reason? I really can't see any in this particular case.
Neither can I, and this was news to me. I have no idea why we'd have truncated this one; it runs completely counter to my mandate that any patches to TortoiseHg we make that you guys don't accept should be kept easily mergeable in case you change your mind. I suspect David didn't know you could splice history. We'll fix it Monday.

--Benjamin
Adrian Buehlmann
2011-08-20 20:22:24 UTC
Permalink
Post by Benjamin Pollack
Post by Adrian Buehlmann
Instead of at least doing a proper repository fork, Fog Creek completely
started the history from scratch at
https://developers.kilnhg.com/Repo/Kiln/TortoiseHg/Shell-Extension
copying the sources of the original shellext.
For what reason? I really can't see any in this particular case.
Neither can I, and this was news to me. I have no idea why we'd have truncated this one; it runs completely counter to my mandate that any patches to TortoiseHg we make that you guys don't accept should be kept easily mergeable in case you change your mind. I suspect David didn't know you could splice history. We'll fix it Monday.
Rather off topic, but let me keep the list-cc (and Greg's cc) anyway:

FWIW, Your copy of the shellext has already been diverged way too far
away. As long as I have to say something on the TortoiseHg shellext (or
I still care), we're certainly not going to depend on Microsoft's ATL
without a compelling reason [1].

ATL is not included in the gratis express editions of Visual C++ or the
gratis SDK C++ compiler we use for building the shellext and the
mercurial C modules for the *.msi installers. Using ATL clearly locks
out potential shellext contributors who don't want to pay for being able
to compile the sources. So far, they only had to pay for Windows itself.
I'd like to keep that.

What's more, it looks like there has been a considerable amount of other
IMHO rather pointless churn like inserting _T(..) all over the place in
your copy.

So I don't really care about this being merged back into the original as
it is at the moment. I was just baffled to see the history being removed
there too (as in kbfiles). Apparently it wasn't intentional, as it seems.

While am at it, adding a Fog Creek copyright on your copy of the
shellext sources is certainly not incorrect from a legalistic POV, but
on open source projects, I think it's usually only done if there were
significant contributions. IMHO, so far, I haven't seen something that
qualifies in that regard in your copy of the shellext sources.

[1]
https://developers.kilnhg.com/Repo/Kiln/TortoiseHg/Shell-Extension/History/1d2cde2d26a7
Na'Tosha Bard
2011-08-04 15:38:14 UTC
Permalink
Hi Greg,
<snip>
Actually, your existing kbfiles repository already discards history.
Fog Creek has conveniently collapsed all of *my* work, plus an unknown
amount of forking and hacking, into a single large revision 0. That's
just wrong. It's morally wrong because it deprives the original author
(me) of public credit for his work. And it's technically wrong because
it makes it much harder to trace a given line of code back in history.
So that gets me to my first gripe with kbfiles/largefiles, which is
that you (Fog Creek) have almost completely erased the record of my
contribution. It's one thing to fork a project for your own needs, but
it's something else entirely to erase the origins of that code from
the historical record. I have no objection to the fork. I am a bit
unhappy that you have not tried very much to contribute changes back
upstream (i.e. to me). But I am most unhappy that you have nearly
erased me from the history. That's not cool.
I can understand your frustration, but I think you're probably assuming some
bad intentions here. There are many reasons why people collapse history in
source repositories; I think very rarely is the reason to hide who actually
wrote the code.

I have contributed back various changes that made rebasing work as well as
fixing merging, reverting, etc. And when I made these contributions,
FogCreek was more than happy to merge my changes in as-is, preserving
history. I have found Benjamin and his team very accepting and
cooperative. So I completely believe Benjamin's explanation for why the
history is the way it is.
Luckily, it's fixable: start with a clone of bfiles, possibly
truncated at your fork point, alongside your private internal
repository. Apply patches from your internal repo to the bfiles clone.
Then apply patches from the public kbfiles repo. End result: a
legitimate repository that captures the true history of the project,
without erasing anyone's contribution. Final step: rename things into
hgext/largefiles so the whole thing can be pulled into Mercurial.
I doubt Matt will accept the entire history into the Mercurial repository;
usually things are applied for submission as a single patch. So really
there is no need for this. Once it's in, we'll all be working against the
main Mercurial repository anyway.
Finally, I have two *technical* objections: the use of dirstate and
the use of standin files. I know, it's pretty rich for *me* to
criticise kbfiles/largefiles for using my design. But I'm in a pretty
good position to know where I got things wrong.
First, the use of a dedicated dirstate for big files was dumb and lazy
on my part. Big files have a different life-cycle from regular files,
How so? What do you think the typical use case is? I'd like to compare it
to what I see here at Unity.
and trying to shoehorn them into a separate dirstate instance just
doesn't work very well. I think the right thing to do is 1) draw a
diagram of the complete life-cycle of big files, 2) implement a custom
data structure (similar idea to dirstate) that tracks that life-cycle,
3) ditch the current hodge-podge of state-tracking mechanisms. I
haven't got very far on this, since bfiles is now a
weekends-and-evenings (and occasional quiet days at work) project for
me. Anyways, this is fixable by just dedicating some programmer time
to the problem.
Second, I'm not convinced that the fundamental design of bfiles -- the
use of standin files -- is appropriate. It complicates things a lot,
and the main reason I chose it was to allow partial bfupdate -- i.e.
don't make me fetch all of the big files in my working directory; I
just want to fetch some of them. I still think that's a nice feature,
but I wonder if it's worth the complication.
The approach taken by 'snap', where the big file hashes are stored
right in file revlogs, sounds interesting. I peeked at the code for
snap once, and was put off by the sheer volume of code. But it's an
interesting idea.
Alas, I'm not sure this is fixable. There are people out there in the
real world using bfiles and/or kbfiles/largefiles, and changing the
fundamental design would break all of their repositories and bfile
stores. ;-(
You're right; it would break a lot of stuff, but it should be possible to
write a tool to run on largefile repos to "upgrade" to the modern format.
But am also not really convinced the result gained is worth the effort
required for all of this.
Oh yeah, for the record, I like the name 'largefiles'.
Me too :-)
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Na'Tosha Bard
2011-08-08 12:05:57 UTC
Permalink
Post by Na'Tosha Bard
Post by Greg Ward
Finally, I have two *technical* objections: the use of dirstate and
the use of standin files. I know, it's pretty rich for *me* to
criticise kbfiles/largefiles for using my design. But I'm in a pretty
good position to know where I got things wrong.
First, the use of a dedicated dirstate for big files was dumb and lazy
on my part. Big files have a different life-cycle from regular files,
How so? What do you think the typical use case is? I'd like to compare
it
Post by Na'Tosha Bard
to what I see here at Unity.
We might be using the term "life-cycle" differently here. I'm not
talking about user-level, I'm talking about the constraints imposed by
the design of bfiles.
Aah yes, I interpreted your statement to mean life-cycle in terms of user
operations.
(Note that I am talking about bfiles here and
*assuming* the same applies to largefiles. I haven't yet done more
than glance at the code, so I don't know how the design of largefiles
has diverged from bfiles.)
Anyways, several months ago I sat down and took a stab at drawing the
state machine for big files. See attached image. I count 13 states for
1) unknown (just created, not yet bfadd'ed)
2) added (bfadd'ed, not committed)
3) dirty added (bfadd then modify: need bfrefresh to return to state
'added')
4) committed pending (committed, not bfput)
5) missing pending (committed then deleted without bfput)
6) removed pending (committed then bfremove'd without bfput)
7) dirty pending (committed then modified without bfput)
8) modified pending (committed, modified, bfrefresh'ed without bfput)
9) clean (committed and bfput)
10) missing (committed, bfput, deleted)
11) removed (committed, bfput, bfremove'd)
12) dirty (committed, bfput, modified)
13) modified (committed, bfput, modified, bfrefresh'ed)
The use of standins and the need to bfput new big file revs makes the
state machine considerably more complex than what dirstate tries to
track. bfiles works around this by storing more bits of state in
various other places. So far this mostly works, but it's a rich source
of bugs and an unnecessary dependency on Mercurial's internal API. I
IMHO bfiles should have its own custom state-tracking mechanism that
tracks the actual life-cycle of big files, not the approximation that
dirstate can be kludged into tracking. It's on my list, but so are a
lot of things.
I don't think this is "substantially" more complicated, it just seems like
we need to keep track of 1 more variable -- whether the bfile has been
uploaded to the central share or not. Maybe bfdirstate does this, but I
don't think so from what I've seen when working on the code myself.

A related comment is that I think there's a bit of added complication --
situations where we have internal mercurial code that modifies our stand-ins
but of course leaves the working copy with an incorrect bfile. Situations
where this arises and we were not correctly updating the bfiles in the
working copy to reflect the stand-ins has been a source of bugs that I've
seen -- specifically it is why rebasing did not work for a long time in
kbfiles; I think a similar situation caused a remove/status bug. I've
almost wondered if we need some global way of saying, "If we call into core
mercurial code, always trust the stand-ins to be correct and update working
copy to reflect that", but maybe that's not safe either. (Just
brainstorming here).

Cheers,
Na'Tosha
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Greg Ward
2011-08-07 23:37:55 UTC
Permalink
Post by Greg Ward
Finally, I have two *technical* objections: the use of dirstate and
the use of standin files. I know, it's pretty rich for *me* to
criticise kbfiles/largefiles for using my design. But I'm in a pretty
good position to know where I got things wrong.
First, the use of a dedicated dirstate for big files was dumb and lazy
on my part. Big files have a different life-cycle from regular files,
How so?  What do you think the typical use case is?  I'd like to compare it
to what I see here at Unity.
We might be using the term "life-cycle" differently here. I'm not
talking about user-level, I'm talking about the constraints imposed by
the design of bfiles. (Note that I am talking about bfiles here and
*assuming* the same applies to largefiles. I haven't yet done more
than glance at the code, so I don't know how the design of largefiles
has diverged from bfiles.)

Anyways, several months ago I sat down and took a stab at drawing the
state machine for big files. See attached image. I count 13 states for
big files:

1) unknown (just created, not yet bfadd'ed)
2) added (bfadd'ed, not committed)
3) dirty added (bfadd then modify: need bfrefresh to return to state 'added')
4) committed pending (committed, not bfput)
5) missing pending (committed then deleted without bfput)
6) removed pending (committed then bfremove'd without bfput)
7) dirty pending (committed then modified without bfput)
8) modified pending (committed, modified, bfrefresh'ed without bfput)
9) clean (committed and bfput)
10) missing (committed, bfput, deleted)
11) removed (committed, bfput, bfremove'd)
12) dirty (committed, bfput, modified)
13) modified (committed, bfput, modified, bfrefresh'ed)

The use of standins and the need to bfput new big file revs makes the
state machine considerably more complex than what dirstate tries to
track. bfiles works around this by storing more bits of state in
various other places. So far this mostly works, but it's a rich source
of bugs and an unnecessary dependency on Mercurial's internal API. I
IMHO bfiles should have its own custom state-tracking mechanism that
tracks the actual life-cycle of big files, not the approximation that
dirstate can be kludged into tracking. It's on my list, but so are a
lot of things.

Greg
Benjamin Pollack
2011-08-08 18:52:14 UTC
Permalink
1) unknown (just created, not yet bfadd'ed)
2) added (bfadd'ed, not committed)
3) dirty added (bfadd then modify: need bfrefresh to return to state 'added')
4) committed pending (committed, not bfput)
5) missing pending (committed then deleted without bfput)
6) removed pending (committed then bfremove'd without bfput)
7) dirty pending (committed then modified without bfput)
8) modified pending (committed, modified, bfrefresh'ed without bfput)
9) clean (committed and bfput)
10) missing (committed, bfput, deleted)
11) removed (committed, bfput, bfremove'd)
12) dirty (committed, bfput, modified)
13) modified (committed, bfput, modified, bfrefresh'ed)
I agree; that's a very complicated state diagram. But largefiles doesn't work that way.

All of the bf* commands from bfiles are dead. largefiles automatically manages the equivalent operations. We submitted patches to enable this for bfiles a year or two ago, but they are optional for bfiles. They have always been mandatory for Kiln.

Axing all the bf* commands *dramatically* simplifies the state machine. In fact, if you make this all automatic, the user-visible lifecycle for largefiles ends up being the same as for any file in Mercurial: it's unknown, added, missing, removed, or modified. When you commit, we copy it to the repository and global caches to allow reverting and the like. Whenever you push to a Mercurial repository, any missing largefiles are uploaded before the bundle is sent; the server rejects the bundle if the corresponding largefiles haven't yet been uploaded. Because kbfiles and largefiles have always worked this way, determining the missing largefiles is as easy as walking the manifests of the changesets you're about to push.

largefiles does currently maintain its own dirstate, but that's just a legacy that I had thus far found convenient to preserve for debugging. There's no technical reason largefiles couldn't simply alter the state of Mercurial's dirstate file directly. If there's agreement that's the right direction, I can't think of any technical reason we couldn't ditch it. Everyone would just have to understand that the physical dirstate "lies" with largefiles enabled.

Incidentally, this transparent operation was the goal of kbfiles, and is the goal of largefiles: working with largefiles should involve nothing more than making sure all relevant Mercurials have largefiles enabled and marking that you want a given file as managed by largefiles. Really, the only user-visible difference in largefiles should be that "hg update" sometimes requires network access to fetch missing largefiles. That's it. With largefiles' support for "hg serve" out-of-the-box, I think we're nearly there.

--Benjamin
Andrew Pritchard
2011-08-10 20:00:32 UTC
Permalink
After a lot of refactoring and bugfixing, as well as plenty of naming
and copyright concerns, it now seems that largefiles is nearly ready
to be added to the Mercurial core repository as a bundled extension.
At this point, if no one has any objections, comments, or concerns, I
can collapse largefiles into a patch against Mercurial (still
preserving the original repositories on http://developers.kilnhg.com),
place it in a clone of http://selenic.com/hg, and submit it as a pull
request (because a 7500-line patchbomb seems ill-advised).
Chris Cannam
2011-08-10 20:26:48 UTC
Permalink
Post by Andrew Pritchard
After a lot of refactoring and bugfixing, as well as plenty of naming
and copyright concerns, it now seems that largefiles is nearly ready
to be added to the Mercurial core repository as a bundled extension.
Is there a short introduction anywhere from which we can learn how to
test this? If not, can you summarise here?

I'm very interested in using this extension, but the documentation
doesn't seem quite enough even for a sympathetic tester like me -- for
example, the help text points at a usage.txt file that doesn't exist
and gives a canonical URL for it that doesn't resolve; "help
lfconvert" points to the lfput command which also appears not to
exist; and design.txt (the only apparent documentation outside the
code) still refers to bfiles throughout.

(I've never used bfiles either, for what it's worth -- I don't know
whether that's likely to be good or bad in learning about this one.)


Chris
Andrew Pritchard
2011-08-10 22:05:09 UTC
Permalink
Post by Chris Cannam
Is there a short introduction anywhere from which we can learn how to
test this?  If not, can you summarise here?
As of about five minutes ago, there's a short introduction at
https://developers.kilnhg.com/Repo/Kiln/largefiles/largefiles/File/usage.txt,
but the gist of it is that all you have to do for new large files and
new repositories is enable the largefiles extension on all clients and
servers, and use 'hg add --large', and old repositories can be
converted with 'hg lfconvert'.
Post by Chris Cannam
the help text points at a usage.txt file that doesn't exist
and gives a canonical URL for it that doesn't resolve;
Just fixed that as well; it now points to the URL I just gave you.
Post by Chris Cannam
"help lfconvert" points to the lfput command which also appears not to
exist;
I hadn't noticed that; I'll look into it now.
Post by Chris Cannam
and design.txt (the only apparent documentation outside the
code) still refers to bfiles throughout.
That's also fixed as of five minutes ago.
Post by Chris Cannam
(I've never used bfiles either, for what it's worth -- I don't know
whether that's likely to be good or bad in learning about this one.)
That's probably a good thing: largefiles aims to be largely (ha!)
transparent to the user, other than 'add --large' to force files to be
tracked as largefiles and requiring network access for updates to new
revisions, whereas bfiles is very hands-on with bfile management,
requiring the user to think of them as an entirely separate kind of
entity from normal tracked files.
Greg Ward
2011-08-11 02:03:46 UTC
Permalink
Post by Andrew Pritchard
After a lot of refactoring and bugfixing, as well as plenty of naming
and copyright concerns, it now seems that largefiles is nearly ready
to be added to the Mercurial core repository as a bundled extension.
Awesome! I should mention that I am quite happy to see this happen,
and look forward to the day when I can migrate from bfiles to
largefiles at work. So I'm finally looking at the code, running the
tests, etc.

Various concerns...

1) I strongly believe you need some better docs: just trying to piece this
thing together from 'hg help' is not enough, which is why I wrote usage.txt
for bfiles. Please steal that and hack it up so it describes largefiles.
Your post to this list that started this thread a week or two ago was also
excellent: I suggest you crib from it liberally.

2) test-lockout.t is failing. I tried Mercurial 1.7, 1.8, and 1.9 -- different
failures, but it failed with all of them. Let me know if you cannot reproduce
and I'll give more detail.

3) The copyright attribution is inaccurate: bfiles was written as a
work for hire,
and so the first copyright line should be my employer, just as I put in the
copyright statements for bfiles a few days ago (see
http://hg.gerg.ca/hg-bfiles/rev/6f832a089582). I took the liberty of
granting some copyright to myself on the grounds that I have spent a lot of
my own time on bfiles over the last year or so. If my boss
disagrees and decides
to give me a hard time over that, I'll let you know. ;-)

4) I see you made liberal use of my hgtest.py module. Too bad: I'm pretty sure
Matt won't like that. The best thing about hgtest.py is that it offended Matt
so deeply that he implemented the fine new unified test system in
Mercurial 1.7.
The worst thing about it is that converting tests from hgtest.py to
unified is
a slow, painful, tedious manual process. I've converted many of the
tests I wrote for
bfiles, but not all.

5) Has anyone reviewed the changes to bfiles since you forked to see
if there's anything
there that needs to be in largefiles? I guess I'll do it if no one
else has, but
a) I don't know the fork point and b) I don't want to duplicate the
work if it has
already been done.

6) It would be nice to make migration from bfiles painless and transparent.
My first suggestion: make the '.hglf' prefix configurable. Then bfiles
users can just set it to .hgbfiles and not have to go through a painful
repository conversion just to remap the standin filenames.

(Hey, has anyone else noticed that '.hglf' looks like 'mercurial linefeed'?
I hope this doesn't get confused with eol...)
Post by Andrew Pritchard
At this point, if no one has any objections, comments, or concerns, I
can collapse largefiles into a patch against Mercurial (still
preserving the original repositories on http://developers.kilnhg.com),
place it in a clone of http://selenic.com/hg, and submit it as a pull
request (because a 7500-line patchbomb seems ill-advised).
I still very much want to see a repo that *accurately* preserves the
history of bfiles + kbfiles + largefiles. That is, I think you should:

1) start with a partial clone of bfiles, up to the changeset that Fog Creek
forked it to create kbfiles
2) import patches from the Kiln repo that trace the history of kbfiles
3) import patches from the largefiles repo that Benjamin created on June 20

The resulting repo will give an accurate history of largefiles, and
IMO that is what should be saved for posterity in a prominent
location.

If you need any help doing that, just ask. I'm motivated to get it
done, but I don't have access to the Kiln repo. All I can do is
suggest ways to get the job done.

Greg
Matt Mackall
2011-08-11 17:26:10 UTC
Permalink
Post by Greg Ward
6) It would be nice to make migration from bfiles painless and transparent.
My first suggestion: make the '.hglf' prefix configurable. Then bfiles
users can just set it to .hgbfiles and not have to go through a painful
repository conversion just to remap the standin filenames.
I'd rather not have a config option if we can avoid it. Perhaps an
automatic fallback?
Post by Greg Ward
(Hey, has anyone else noticed that '.hglf' looks like 'mercurial linefeed'?
I hope this doesn't get confused with eol...)
Yeah, that's a bit odd.
--
Mathematics is the supreme nostalgia of our time.
Arne Babenhauserheide
2011-08-13 10:45:51 UTC
Permalink
Post by Matt Mackall
Post by Greg Ward
(Hey, has anyone else noticed that '.hglf' looks like 'mercurial
linefeed'? I hope this doesn't get confused with eol...)
Yeah, that's a bit odd.
I thought the same.

Why not just .hglarge?

Best wishes,
Arne
Andrew Pritchard
2011-08-14 04:53:00 UTC
Permalink
That was actually a conscious decision to make the path as short as
possible to avoid taking up valuable characters on filesystems whose
path lengths are limited. (This can also help avoid forcing the
revlog filenames into the hashed form on fncache repositories,
assuming I understand fncache correctly).

Ultimately, the standin files aren't particularly user-visible, in
that every command is wrapped to hide their existence from the user,
so I took the practical side of the tradeoff over the aesthetics of
'ls -a'.
Andrew Pritchard
2011-08-11 17:29:38 UTC
Permalink
Post by Greg Ward
1) I strongly believe you need some better docs: just trying to piece this
  thing together from 'hg help' is not enough, which is why I wrote usage.txt
  for bfiles. Please steal that and hack it up so it describes largefiles.
  Your post to this list that started this thread a week or two ago was also
  excellent: I suggest you crib from it liberally.
As of a few hours before you sent this message, I added a usage.txt
and updated a lot of the documentation. I'm not aware of any
documentation that is still missing or wrong, so if you notice any,
mention it.
Post by Greg Ward
2) test-lockout.t is failing. I tried Mercurial 1.7, 1.8, and 1.9 -- different
  failures, but it failed with all of them. Let me know if you cannot reproduce
  and I'll give more detail.
Without seeing the failure, I'm going to guess that it is limited to
output styling of the "this repository uses largefiles" message when
interacting with a server. If this is the case, this is because
you're using hg-stable or a revision of Mercurial before f4522df38c65,
where support is added for explicitly handling and formatting errors
in remote repositories.
Post by Greg Ward
3) The copyright attribution is inaccurate
Fixed
Post by Greg Ward
4) I see you made liberal use of my hgtest.py module. Too bad: I'm pretty sure
  Matt won't like that. The best thing about hgtest.py is that it offended Matt
  so deeply that he implemented the fine new unified test system in
Mercurial 1.7.
  The worst thing about it is that converting tests from hgtest.py to
unified is
  a slow, painful, tedious manual process. I've converted many of the
tests I wrote for
  bfiles, but not all.
I'm not really sure what to do about this; there are a lot of tests,
and I agree that converting them all would be painful. Still, maybe
it's necessary.
Post by Greg Ward
5) Has anyone reviewed the changes to bfiles since you forked to see
if there's anything there that needs to be in largefiles?
The first commit after the common ancestor is
http://hg.gerg.ca/hg-bfiles/rev/069cdf479b30. I'll sift through them
if you don't get to it first.
Post by Greg Ward
6) It would be nice to make migration from bfiles painless and transparent.
  My first suggestion: make the '.hglf' prefix configurable. Then bfiles
  users can just set it to .hgbfiles and not have to go through a painful
  repository conversion just to remap the standin filenames.
The problem with a configurable prefix is that it makes a huge mess
out of interaction between clients and servers, in that clients and
servers would both have to agree on the prefix; if they didn't, all
sorts of crazy and confusing things could happen. Specifically, Kiln
likes to (and in a future where largefiles is bundle with Mercurial,
lots of other hosting providers will like to; and in a few changesets,
hg serve will probably like to) refuse pushes that reference
largefiles it doesn't know about. If the prefix is mismatched between
client and server, it would be possible to push changes without their
corresponding largefiles, or even worse the server could reject
innocent pushes that happen to have the server's configured prefix.
And, even more troubling for hosting providers, their clients may each
use their own prefixes, giving them no way to figure out what the
prefix is. If you really want to use your bfiles repositories as-is,
it's easy to write a several-line extension that just sets
lfutil.shortname in extsetup(ui), but I'm planning on adding a
conversion script anyway (I think it's likely to be nothing more than
a filemap for the convert extension).
Post by Greg Ward
I still very much want to see a repo that *accurately* preserves the
history of bfiles + kbfiles + largefiles.
I can work on extracting the commits in the Kiln extensions repo that
touch the bfiles code and dump them into a repository on
http://developers.kilnhg.com. For the moment, actually combining
these into a single repo is not at the top of my to-do list, but at
least all the history will be available.
Andrew Pritchard
2011-08-11 22:05:36 UTC
Permalink
Look now and be happy! The repositories on
http://developers.kilnhg.com have been replaced with versions I
spliced together from the three repositories. They should contain the
full history, with few exceptions (a medium chunk of code inexplicably
changed between the latest code in our private repositories and the
first commit in the truncated largefiles repositories; I folded that
into that commit under the assumption that they were changed locally
before committing; also, one two-commit branch merge's second parent
was lost, but all of the code it touched has since been replaced, so
it should have no effect now). 'hg churn' also over-reports my change
counts by about 14,000 lines because I manually moved all the files
out of a subdirectory into the repository root to get things ready for
the splice, but everyone else's changes are properly attributed to
their original committer.
Greg Ward
2011-08-14 20:24:33 UTC
Permalink
[me]
Post by Andrew Pritchard
Post by Greg Ward
6) It would be nice to make migration from bfiles painless and transparent.
  My first suggestion: make the '.hglf' prefix configurable. Then bfiles
  users can just set it to .hgbfiles and not have to go through a painful
  repository conversion just to remap the standin filenames.
[Andrew Pritchard]
Post by Andrew Pritchard
The problem with a configurable prefix is that it makes a huge mess
out of interaction between clients and servers, in that clients and
servers would both have to agree on the prefix; if they didn't, all
sorts of crazy and confusing things could happen.
Eek. So I'd have to make damn sure to configure largefiles the same on
all clients and servers. Luckily, I have the ability to do that! (I
wrote an extension that gives me central control over all of my
developers' .hg/hgrc files. I try not to let the power go to my head.)

(For "I" read "anyone wanting to migrate transparently from bfiles to
largefiles".)

Idea: how hard would it be to add the standin prefix dir to the wire
protocol? It should only have to be sent once per conversation.
Post by Andrew Pritchard
 If you really want to use your bfiles repositories as-is,
it's easy to write a several-line extension that just sets
lfutil.shortname in extsetup(ui)
True enough. But then I have to make sure that extension is installed
and enabled on all client and server machines, which is approximately
as difficult as making sure largefiles is configured the same
everywhere. To me, it seems easier to make largefiles configurable
(or, as MPM suggested, adaptable).
Post by Andrew Pritchard
but I'm planning on adding a
conversion script anyway (I think it's likely to be nothing more than
a filemap for the convert extension).
Yeah, that should be easy. *But* it's not a transparent migration,
since all of our changeset IDs would change. We have thousands of
Bugzilla slips that link to changesets in hgweb. Who's going to
migrate them?

Greg
Andrew Pritchard
2011-08-14 22:24:21 UTC
Permalink
Post by Greg Ward
Idea: how hard would it be to add the standin prefix dir to the wire
protocol? It should only have to be sent once per conversation.
That would actually not be particularly difficult, and it could easily
be stored in '.hg/largefiles/standins' or the like. The
largefiles-kiln branch already has code for dealing with per-repo
prefixes, to make kbfiles repositories continue to work, but it
currently just checks for the existence of ''.hg/store/data/.hglf/*.i"
or ".kbf/*.i". This would still make hosting providers' jobs a bit
harder, since they would have to keep track of the prefix in order to
handle largefiles separately.

Just brainstorming here, but currently the value of the 'lfilestore'
capability carries no meaning: I set it arbitrarily to 'serve' to
leave it open to future versions of the protocol or alternative ways
of providing a store, but it could be used as the prefix, as in
'lfilestore=.hglf'. This would be a bit weird in that the
capabilities reported by the server could vary per repository, and it
also doesn't solve the problem of identifying the prefix when pushing
largefiles changesets to a repository that currently has none. Adding
a wireproto command for that single case seems a bit excessive.

Abandoning that line of thought almost entirely, it could use the
pushkey mechanism, adding no new wireproto commands and even taking
advantage of the warn-on-already-existing key behavior in the case of
a repository that already has largefiles.

There could be some weirdness if two developers simultaneously add
largefiles to a previously-vanilla repository and try to push them,
but as long as new largefiles repositories always use a specific
prefix, that shouldn't cause any problems.

Thoughts?
Greg Ward
2011-08-15 00:06:07 UTC
Permalink
Post by Andrew Pritchard
Post by Greg Ward
Idea: how hard would it be to add the standin prefix dir to the wire
protocol? It should only have to be sent once per conversation.
That would actually not be particularly difficult, and it could easily
be stored in '.hg/largefiles/standins' or the like.  The
largefiles-kiln branch already has code for dealing with per-repo
prefixes, to make kbfiles repositories continue to work, but it
currently just checks for the existence of ''.hg/store/data/.hglf/*.i"
or ".kbf/*.i".
So users of kbfiles have the same problem as users of bfiles? I.e.
switching to largefiles would be easy if largefiles transparently
recognized the old standin dir, but disruptive if we force them to
convert with a filemap?
Post by Andrew Pritchard
 This would still make hosting providers' jobs a bit
harder, since they would have to keep track of the prefix in order to
handle largefiles separately.
If they choose to. IMHO if a hosting provider chooses to say "we
support largefiles, but *only* with the canonical default .hglf [or
whatever we settle on] prefix", that's entirely reasonable. bfiles and
kbfiles should be viewed as prototypes.
Post by Andrew Pritchard
Just brainstorming here, but currently the value of the 'lfilestore'
capability carries no meaning: I set it arbitrarily to 'serve' to
leave it open to future versions of the protocol or alternative ways
of providing a store, but it could be used as the prefix, as in
'lfilestore=.hglf'.  This would be a bit weird in that the
capabilities reported by the server could vary per repository, and it
also doesn't solve the problem of identifying the prefix when pushing
largefiles changesets to a repository that currently has none.
That does seem weird.
Post by Andrew Pritchard
Abandoning that line of thought almost entirely, it could use the
pushkey mechanism, adding no new wireproto commands and even taking
advantage of the warn-on-already-existing key behavior in the case of
a repository that already has largefiles.
Yeah, that sounds sensible.
Post by Andrew Pritchard
There could be some weirdness if two developers simultaneously add
largefiles to a previously-vanilla repository and try to push them,
but as long as new largefiles repositories always use a specific
prefix, that shouldn't cause any problems.
I don't think we should expose any way to use a non-default standin
prefix. .hgbfiles/ and .hgkbf/ are legacy prefixes that we should
support for the convenience of users, but there is no reason to create
a new largefiles repo using those prefixes. One ring to rule them all,
and all that.

Greg
Na'Tosha Bard
2011-09-22 14:19:03 UTC
Permalink
So, to pick this topic up again, can we get an open punchlist of things that
the mercurial community (and project leader) believes is "missing" for the
largefiles extension? E.g, what is missing for it to be accepted into
mercurial?

The main repository is living here:
https://developers.kilnhg.com/Repo/Kiln/largefiles/largefiles

(there's also a branch with some compatibility stuff that's useful for Kiln
users, but that is not so relevant here).

Cheers,
Na'Tosha
Post by Greg Ward
Post by Andrew Pritchard
Post by Greg Ward
Idea: how hard would it be to add the standin prefix dir to the wire
protocol? It should only have to be sent once per conversation.
That would actually not be particularly difficult, and it could easily
be stored in '.hg/largefiles/standins' or the like. The
largefiles-kiln branch already has code for dealing with per-repo
prefixes, to make kbfiles repositories continue to work, but it
currently just checks for the existence of ''.hg/store/data/.hglf/*.i"
or ".kbf/*.i".
So users of kbfiles have the same problem as users of bfiles? I.e.
switching to largefiles would be easy if largefiles transparently
recognized the old standin dir, but disruptive if we force them to
convert with a filemap?
Post by Andrew Pritchard
This would still make hosting providers' jobs a bit
harder, since they would have to keep track of the prefix in order to
handle largefiles separately.
If they choose to. IMHO if a hosting provider chooses to say "we
support largefiles, but *only* with the canonical default .hglf [or
whatever we settle on] prefix", that's entirely reasonable. bfiles and
kbfiles should be viewed as prototypes.
Post by Andrew Pritchard
Just brainstorming here, but currently the value of the 'lfilestore'
capability carries no meaning: I set it arbitrarily to 'serve' to
leave it open to future versions of the protocol or alternative ways
of providing a store, but it could be used as the prefix, as in
'lfilestore=.hglf'. This would be a bit weird in that the
capabilities reported by the server could vary per repository, and it
also doesn't solve the problem of identifying the prefix when pushing
largefiles changesets to a repository that currently has none.
That does seem weird.
Post by Andrew Pritchard
Abandoning that line of thought almost entirely, it could use the
pushkey mechanism, adding no new wireproto commands and even taking
advantage of the warn-on-already-existing key behavior in the case of
a repository that already has largefiles.
Yeah, that sounds sensible.
Post by Andrew Pritchard
There could be some weirdness if two developers simultaneously add
largefiles to a previously-vanilla repository and try to push them,
but as long as new largefiles repositories always use a specific
prefix, that shouldn't cause any problems.
I don't think we should expose any way to use a non-default standin
prefix. .hgbfiles/ and .hgkbf/ are legacy prefixes that we should
support for the convenience of users, but there is no reason to create
a new largefiles repo using those prefixes. One ring to rule them all,
and all that.
Greg
_______________________________________________
Mercurial-devel mailing list
http://selenic.com/mailman/listinfo/mercurial-devel
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Martin Geisler
2011-09-22 16:37:48 UTC
Permalink
Post by Na'Tosha Bard
So, to pick this topic up again, can we get an open punchlist of
things that the mercurial community (and project leader) believes is
"missing" for the largefiles extension? E.g, what is missing for it to
be accepted into mercurial?
I guess you'll have to patchbomb it here eventually. Also, you could
describe the features in a mail here -- I found a usage.txt file in the
repository which seems relevant:

Largefiles allows for tracking large, incompressible binary files in
Mercurial without requiring excessive bandwidth for clones and pulls.
Files added as largefiles are not tracked directly by Mercurial;
rather, their revisions are identified by a checksum, and Mercurial
tracks these checksums. This way, when you clone a repository or pull
in changesets, the large files in older revisions of the repository
are not needed, and only the ones needed to update to the current
version are downloaded. This saves both disk space and bandwidth.

If you are starting a new repository or adding new large binary files,
using largefiles for them is as easy as adding '--large' to your hg
add command. For example:

$ dd if=/dev/urandom of=thisfileislarge count=2000
$ hg add --large thisfileislarge
$ hg commit -m 'add thisfileislarge, which is large, as a largefile'

When you push a changeset that affects largefiles to a remote
repository, its largefile revisions will be uploaded along with it.
Note that the remote Mercurial must also have the largefiles extension
enabled for this to work.

When you pull a changeset that affects largefiles from a remote
repository, nothing different from Mercurial's normal behavior
happens. However, when you update to such a revision, any largefiles
needed by that revision are downloaded and cached if they have never
been downloaded before. This means that network access is required to
update to revision you have not yet updated to.

If you already have large files tracked by Mercurial without the
largefiles extension, you will need to convert your repository in
order to benefit from largefiles. This is done with the 'hg lfconvert'
command:

$ hg lfconvert --size 10 oldrepo newrepo

By default, in repositories that already have largefiles in them, any
new file over 10MB will automatically be added as largefiles. To
change this threshhold, set [largefiles].size in your Mercurial config
file to the minimum size in megabytes to track as a largefile, or use
the --lfsize option to the add command (also in megabytes):

[largefiles]
size = 2

$ hg add --lfsize 2

The [largefiles].patterns config option allows you to specify specific
space-separated filename patterns (in shell glob syntax) that should
always be tracked as largefiles:

[largefiles]
pattens = *.jpg *.{png,bmp} library.zip content/audio/*

I tried cloning the largefiles repo into the hgext folder in Mercurial
and ran

% pyflakes hgext/largefiles/*.py
hgext/largefiles/basestore.py:15: 'shutil' imported but unused
hgext/largefiles/basestore.py:17: 'error' imported but unused
hgext/largefiles/basestore.py:17: 'url_' imported but unused
hgext/largefiles/lfutil.py:39: redefinition of function 'dirstate_walk' from line 35
hgext/largefiles/localstore.py:57: undefined name 'err'
hgext/largefiles/overrides.py:13: 're' imported but unused
hgext/largefiles/overrides.py:28: 'proto' imported but unused
hgext/largefiles/overrides.py:611: local variable 'dest' is assigned to but never used
hgext/largefiles/overrides.py:662: redefinition of function 'write' from line 647
hgext/largefiles/proto.py:7: 'shutil' imported but unused
hgext/largefiles/proto.py:109: undefined name 'l'
hgext/largefiles/proto.py:126: undefined name 'capabilities_orig'
hgext/largefiles/proto.py:155: undefined name 'ssh_oldcallstream'
hgext/largefiles/proto.py:162: undefined name 'http_oldcallstream'
hgext/largefiles/remotestore.py:57: undefined name 'HTTPError'
hgext/largefiles/remotestore.py:61: undefined name 'urllib2'
hgext/largefiles/remotestore.py:86: local variable 'expect_hash' is assigned to but never used
hgext/largefiles/remotestore.py:95: undefined name 'store_path'
hgext/largefiles/remotestore.py:100: undefined name 'store_path'
hgext/largefiles/reposetup.py:15: 'httprepo' imported but unused
hgext/largefiles/reposetup.py:34: undefined name '_'
hgext/largefiles/reposetup.py:224: redefinition of unused 'node' from line 15

You should look into those errors.
Post by Na'Tosha Bard
https://developers.kilnhg.com/Repo/Kiln/largefiles/largefiles
(there's also a branch with some compatibility stuff that's useful for
Kiln users, but that is not so relevant here).
Cheers,
Na'Tosha
--
Martin Geisler

Mercurial links: http://mercurial.ch/
Na'Tosha Bard
2011-09-23 11:39:32 UTC
Permalink
Hello,
Post by Martin Geisler
[...]
I tried cloning the largefiles repo into the hgext folder in Mercurial
and ran
% pyflakes hgext/largefiles/*.py
hgext/largefiles/basestore.py:15: 'shutil' imported but unused
hgext/largefiles/basestore.py:17: 'error' imported but unused
hgext/largefiles/basestore.py:17: 'url_' imported but unused
hgext/largefiles/lfutil.py:39: redefinition of function 'dirstate_walk' from line 35
hgext/largefiles/localstore.py:57: undefined name 'err'
hgext/largefiles/overrides.py:13: 're' imported but unused
hgext/largefiles/overrides.py:28: 'proto' imported but unused
hgext/largefiles/overrides.py:611: local variable 'dest' is assigned to but never used
hgext/largefiles/overrides.py:662: redefinition of function 'write' from line 647
hgext/largefiles/proto.py:7: 'shutil' imported but unused
hgext/largefiles/proto.py:109: undefined name 'l'
hgext/largefiles/proto.py:126: undefined name 'capabilities_orig'
hgext/largefiles/proto.py:155: undefined name 'ssh_oldcallstream'
hgext/largefiles/proto.py:162: undefined name 'http_oldcallstream'
hgext/largefiles/remotestore.py:57: undefined name 'HTTPError'
hgext/largefiles/remotestore.py:61: undefined name 'urllib2'
hgext/largefiles/remotestore.py:86: local variable 'expect_hash' is
assigned to but never used
hgext/largefiles/remotestore.py:95: undefined name 'store_path'
hgext/largefiles/remotestore.py:100: undefined name 'store_path'
hgext/largefiles/reposetup.py:15: 'httprepo' imported but unused
hgext/largefiles/reposetup.py:34: undefined name '_'
hgext/largefiles/reposetup.py:224: redefinition of unused 'node' from line 15
You should look into those errors.
I will certainly look into these errors and get them fixed. I've contacted
FogCreek off-list about what their plan and schedule is so we can try to
work together and get a patchbomb sent here for review. If they don't have
time, I will take up the big push if necessary.

I really believe this extension is extremely important for Mercurial. It
opens the door to a new group of users who would otherwise not be able to
use distributed version control at all, and I have not found any solution
for this problem in other DVCS systems like Bazaar and Git.

Largefiles has really come a long way as well, with the group of a lot of
people. The only things end-users have to do to use largefiles are:

1) Turn it on (the same as any extension)
2) If they don't set up their .hgrc to automatically add files over a
certain size, use the "--large" flag when adding a new largefile.

Everything else is completely automatic; we don't even think about
largefiles bacuase they are taken care of with standard mercurial
operations. It really is a breeze to use.

One thing I know we're missing is a concrete set of instructions somehwere
for how to turn them on server-side (I think it's as simple as enabling
largefiles on the server and running hgserve, but I'm not 100% sure; I'm
looking into that also; I will test it on my end then make sure it's
documented).

Please anyone let me know if there are other obvious issues that would need
to be addresssed.

Cheers,
Na'Tosha
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Na'Tosha Bard
2011-09-23 15:47:16 UTC
Permalink
Post by Na'Tosha Bard
Hello,
Post by Martin Geisler
[...]
I tried cloning the largefiles repo into the hgext folder in Mercurial
and ran
% pyflakes hgext/largefiles/*.py
hgext/largefiles/basestore.py:15: 'shutil' imported but unused
hgext/largefiles/basestore.py:17: 'error' imported but unused
hgext/largefiles/basestore.py:17: 'url_' imported but unused
hgext/largefiles/lfutil.py:39: redefinition of function 'dirstate_walk' from line 35
hgext/largefiles/localstore.py:57: undefined name 'err'
hgext/largefiles/overrides.py:13: 're' imported but unused
hgext/largefiles/overrides.py:28: 'proto' imported but unused
hgext/largefiles/overrides.py:611: local variable 'dest' is assigned to but never used
hgext/largefiles/overrides.py:662: redefinition of function 'write' from line 647
hgext/largefiles/proto.py:7: 'shutil' imported but unused
hgext/largefiles/proto.py:109: undefined name 'l'
hgext/largefiles/proto.py:126: undefined name 'capabilities_orig'
hgext/largefiles/proto.py:155: undefined name 'ssh_oldcallstream'
hgext/largefiles/proto.py:162: undefined name 'http_oldcallstream'
hgext/largefiles/remotestore.py:57: undefined name 'HTTPError'
hgext/largefiles/remotestore.py:61: undefined name 'urllib2'
hgext/largefiles/remotestore.py:86: local variable 'expect_hash' is
assigned to but never used
hgext/largefiles/remotestore.py:95: undefined name 'store_path'
hgext/largefiles/remotestore.py:100: undefined name 'store_path'
hgext/largefiles/reposetup.py:15: 'httprepo' imported but unused
hgext/largefiles/reposetup.py:34: undefined name '_'
hgext/largefiles/reposetup.py:224: redefinition of unused 'node' from line 15
You should look into those errors.
I will certainly look into these errors and get them fixed.
FYI, this is done now in Unity Technologies' copy of the repository:
http://fogbugz.unity3d.com/kiln/Repo/Mercurial/Group/largefiles

I have requeted that FogCreek pull the changes into the "main" repo on
developers.kilnhg.com.

Cheers,
Na'Tosha
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Matt Mackall
2011-09-23 22:18:39 UTC
Permalink
Post by Na'Tosha Bard
Hello,
Post by Martin Geisler
[...]
I tried cloning the largefiles repo into the hgext folder in Mercurial
and ran
% pyflakes hgext/largefiles/*.py
hgext/largefiles/basestore.py:15: 'shutil' imported but unused
hgext/largefiles/basestore.py:17: 'error' imported but unused
hgext/largefiles/basestore.py:17: 'url_' imported but unused
hgext/largefiles/lfutil.py:39: redefinition of function 'dirstate_walk' from line 35
hgext/largefiles/localstore.py:57: undefined name 'err'
hgext/largefiles/overrides.py:13: 're' imported but unused
hgext/largefiles/overrides.py:28: 'proto' imported but unused
hgext/largefiles/overrides.py:611: local variable 'dest' is assigned to but never used
hgext/largefiles/overrides.py:662: redefinition of function 'write' from line 647
hgext/largefiles/proto.py:7: 'shutil' imported but unused
hgext/largefiles/proto.py:109: undefined name 'l'
hgext/largefiles/proto.py:126: undefined name 'capabilities_orig'
hgext/largefiles/proto.py:155: undefined name 'ssh_oldcallstream'
hgext/largefiles/proto.py:162: undefined name 'http_oldcallstream'
hgext/largefiles/remotestore.py:57: undefined name 'HTTPError'
hgext/largefiles/remotestore.py:61: undefined name 'urllib2'
hgext/largefiles/remotestore.py:86: local variable 'expect_hash' is
assigned to but never used
hgext/largefiles/remotestore.py:95: undefined name 'store_path'
hgext/largefiles/remotestore.py:100: undefined name 'store_path'
hgext/largefiles/reposetup.py:15: 'httprepo' imported but unused
hgext/largefiles/reposetup.py:34: undefined name '_'
hgext/largefiles/reposetup.py:224: redefinition of unused 'node' from line 15
You should look into those errors.
I will certainly look into these errors and get them fixed. I've contacted
FogCreek off-list about what their plan and schedule is so we can try to
work together and get a patchbomb sent here for review. If they don't have
time, I will take up the big push if necessary.
At some point, I'm going to need to see patches start showing up here so
that I can review or apply them.
Post by Na'Tosha Bard
I really believe this extension is extremely important for Mercurial.
I don't think any more pitching is necessary, I'm totally on board for
including this and I think everyone else is excited about it too. It's
just down to the details of the actual implementation at this point.

But we are running out of time for 2.0. I'd like to see some code show
up on the list before Oct 1 if at all possible so that we can get all
the back and forth done by the code freeze date.

(FYI, I'm not going to merge the project DAGs by doing a pull.)
--
Mathematics is the supreme nostalgia of our time.
Na'Tosha Bard
2011-09-24 15:54:13 UTC
Permalink
Post by Matt Mackall
But we are running out of time for 2.0. I'd like to see some code show
up on the list before Oct 1 if at all possible so that we can get all
the back and forth done by the code freeze date.
(FYI, I'm not going to merge the project DAGs by doing a pull.)
I just sent a patchbomb to the list, but I'm happy to break it into smaller
patches if that helps. I'm not sure what the "desired" amount of noise is
(I assume less is better, but the two patches are both really huge).

The entire repository can be found at
http://fogbugz.unity3d.com/kiln/Repo/Mercurial/Group/largefiles if that's
easier for anyone.
--
*Na'Tosha Bard*
Build & Infrastructure Developer | Unity Technologies

*E-Mail:* ***@unity3d.com
*Skype:* natosha.bard
Greg Ward
2011-09-28 22:04:23 UTC
Permalink
Post by Na'Tosha Bard
So, to pick this topic up again, can we get an open punchlist of things that
the mercurial community (and project leader) believes is "missing" for the
largefiles extension? E.g, what is missing for it to be accepted into
mercurial?
I have two requests:

* [essential] make it possible for largefiles to seamlessly use repos
where the standins live in .hgbfiles, so that existing users of
bfiles don't have to go through a painful conversion process

* [wishlist] make it possible for me to turn of largefiles in my
working dir, so I'm not bothered by stupid large files that just
annoy me, waste disk space, waste time, waste bandwidth, etc.

I'm willing to do the work for both of these, but I'd rather do it by
sending patches after largefiles has been added to hgext. That way I
get the credit. ;-)

Greg
Benjamin Pollack
2011-10-01 23:57:25 UTC
Permalink
Post by Greg Ward
* [essential] make it possible for largefiles to seamlessly use repos
where the standins live in .hgbfiles, so that existing users of
bfiles don't have to go through a painful conversion process
I will continue to argue that idea is misguided.

The main point, as we've discussed at length previously, is that it results in bad behavior [1]. To recap: when you push, largefiles looks for largefiles in the changesets you're pushing, asks the server whether it has them already, and uploads any that the server doesn't already have. It doesn't do any checking for changesets it's not pushing, since, if you're only ever using largefiles, changesets already on the server *must* have their corresponding largefiles. For the common case I'm envisioning clients doing here--simply setting the largefiles prefix to .bfiles--this precondition is invalid, and you'll trivially be able to make repositories that no one can actually clone.

Especially given that point, I remain unclear why we'd allow changing the largefiles prefix when we don't allow changing the name of .hg/.hgignore/.hgeol and so on. If, for your particular setup, you *know* that the largefiles preconditions are actually satisfied, and you really don't want to reconvert, it's trivial to add a piggyback extension that swizzles out the standin prefix. That's what we may do for grandfathering in .kbf standins for existing Kiln clients, for example. But I really don't believe that behavior should be in core Mercurial.

largefiles is based on bfiles, and owes it a tremendous amount of debt, but they're simply not interchangeable, and I think trying to force them to be is a bad idea.

--Benjamin

[1]: http://selenic.com/pipermail/mercurial-devel/2011-August/033751.html
Greg Ward
2011-10-02 02:35:40 UTC
Permalink
Post by Benjamin Pollack
 * [essential] make it possible for largefiles to seamlessly use repos
   where the standins live in .hgbfiles, so that existing users of
   bfiles don't have to go through a painful conversion process
I will continue to argue that idea is misguided.
The main point, as we've discussed at length previously, is that it results in bad behavior [1].  To recap: when you push, largefiles looks for largefiles in the changesets you're pushing, asks the server whether it has them already, and uploads any that the server doesn't already have.  It doesn't do any checking for changesets it's not pushing, since, if you're only ever using largefiles, changesets already on the server *must* have their corresponding largefiles.  For the common case I'm envisioning clients doing here--simply setting the largefiles prefix to .bfiles--this precondition is invalid, and you'll trivially be able to make repositories that no one can actually clone.
But I don't want to *configure* the standin dir. That was my original
proposal, but then Matt said "why not autodetect?" and I immediately
slapped my forehead and said "duh, yes, of course". Can't remember if
I said so publicly.

Here's the behaviour I want:

if '.hglf/' is a dir:
use that as the largefiles standin prefix
elif '.hgbfiles' is a dir:
use that as the largefiles standin prefix
elif '.kbf' is a dir:
use that as the largefiles standin prefix
else:
no largefiles in this changeset: fallback to 'hglf' in case
someone adds one

Obviously that has to be done pretty early in setting up the
extension, but it can't be a constant anymore. I haven't looked at the
code enough to know if this is a straightforward refactoring or an
"OMG, don't go there" thing.
Post by Benjamin Pollack
Especially given that point, I remain unclear why we'd allow changing the largefiles prefix when we don't allow changing the name of .hg/.hgignore/.hgeol and so on.  If, for your particular setup, you *know* that the largefiles preconditions are actually satisfied, and you really don't want to reconvert, it's trivial to add a piggyback extension that swizzles out the standin prefix.  That's what we may do for grandfathering in .kbf standins for existing Kiln clients, for example.  But I really don't believe that behavior should be in core Mercurial.
There's no way in hell I'm going to convince my manager and 40
developers that we need to convert our entire repository (120k
changesets, with stuff in .hgbfiles/ dating back to 2002 -- this was
converted from CVS) just so I can stop maintaining bfiles and switch
us to largefiles. Who's going to update all the changeset ID
references in our Bugzilla comments? Or our build database? *shudder*
The only way that's gonna happen is if I can make it a seamless
transition.

Repeat that gripe for all other users of bfiles out there. I don't
know how many there are, but there's more than just one of us!

However, a silly little piggyback extension might be a tolerable
compromise. I'll keep it in my back pocket. Now... what to call it...
I know! How about "bfiles"? I'd love to see that monster shrink down
to a 5-line monkeypatch.

Greg
Matt Mackall
2011-10-02 19:40:23 UTC
Permalink
Post by Greg Ward
use that as the largefiles standin prefix
use that as the largefiles standin prefix
use that as the largefiles standin prefix
no largefiles in this changeset: fallback to 'hglf' in case
someone adds one
I tend to agree with this approach. Largefiles should be
backwards-compatible with the installed base if possible.
--
Mathematics is the supreme nostalgia of our time.
Benjamin Pollack
2011-10-05 22:03:51 UTC
Permalink
Post by Matt Mackall
Post by Greg Ward
use that as the largefiles standin prefix
use that as the largefiles standin prefix
use that as the largefiles standin prefix
no largefiles in this changeset: fallback to 'hglf' in case
someone adds one
I tend to agree with this approach. Largefiles should be
backwards-compatible with the installed base if possible.
All right. I'll submit a patch with this logic.

--Benjamin

Loading...