Hey there fellow core-workflowers; I've been following the GitHub transition for a while now, and have some questions (which may have been answered already, I apologize in advance if so!). So, I read the PEP again to see if it was answered, seems it wasn't clear (or I'm visually impaired, you get to choose). I understand that there's already a semi-official mirror of the cpython repo on GitHub, and I've been wondering why it isn't enough for our needs. Sure, a bunch of stuff needs to be done (like the CLA bot and the PR <-> issue linking), but surely they could be done on the current mirror. My workflow uses the GitHub mirror, and my patches are compatible with b.p.o and Rietveld. Is there something I'm missing as to why we can't re-use this one? Then, as someone who's been using git and GitHub for almost everything code-related and never touched Mercurial, is there something I can do to help with the transition? I would really love a more accessible workflow for both core developers and external contributors as soon as possible (don't we all?), so if I can help I'd love to; I do realize I'm a bit late to the party though. Keep up the good work! -Emanuel
On Sun, May 8, 2016 at 4:12 PM, Émanuel Barry
I understand that there's already a semi-official mirror of the cpython repo on GitHub, and I've been wondering why it isn't enough for our needs.
It is suitable for our needs. Our last discussion was about how do we ascertain that cpython git repo has the same history as the hg repo, so that after migrate we do not loose any information from the old system. This could be done using: * check the number of commits in both repos for each branch * checking the hash of the source files in two repos. * (And do we go about validating each piece of commit log graph too)? If you have any suggestions, since you are using the cpython git mirror, please feel free to share your thoughts. Welcome to the party! Thanks, Senthil
Why thank you! I probably missed that last discussion.
Do you need some help? I can probably generate a file with that information and pass it over so you can check it matches the Mercurial one. I’m not used to dealing with the log graph though, but I can probably manage something. Here’s what I have in mind, let me know if you have another/better idea:
Take each X commit (say, every 100th or 1000th commit, or even every commit if we decide to be insane^Wprecise), store hashes of all files at that revision with possibly the file tree, in a .py file as a list or dict, or json or anything you prefer. Then I upload it for you to look at and you can compare with the mercurial repo. Or we run the same script on the mercurial repo and compare the resulting files.
I can work on that this week, probably. Sounds like a good idea?
-Emanuel
From: Senthil Kumaran [mailto:senthil@uthcode.com]
Sent: Sunday, May 08, 2016 7:29 PM
To: Émanuel Barry
Cc: core-workflow
Subject: Re: [core-workflow] Some questions
On Sun, May 8, 2016 at 4:12 PM, Émanuel Barry
Hi Émanuel,
On Sun, May 8, 2016 at 4:40 PM, Émanuel Barry
Take each X commit (say, every 100th or 1000th commit, or even every commit if we decide to be insane^Wprecise), store hashes of all files at that revision with possibly the file tree, in a .py file as a list or dict, or json or anything you prefer. Then I upload it for you to look at and you can compare with the mercurial repo. Or we run the same script on the mercurial repo and compare the resulting files.
If we store anything externally, that could start limiting us. I looked at the problem in this angle - final cpython git repo has ~10000 commits in master branch. That's not a large number to deal with. The orginal hg repo should have exact number of commits. We have to do a diff between each of these commits, including merge commits. and check if contents of those commits are same, if we encounter anything where git-repo differs in content or history from hg-repo, we alert and fail. Since this is a history checking operation and we could complete this in O(minutes) or ~1 hour to validate the repos. This will give us confidence on the migration, and will help us evaluate multiple hg -> git repos that have been migrated at different points in time. This feature will go in this tool: https://github.com/orsenthil/cpython-hg-to-git , which we will use to migrate, sync, and validate hg->git repos. If interested, you could research for efficient way to do the above operation and submit a pull request against that tool. HTH, Senthil
On 05/08/2016 05:43 PM, Senthil Kumaran wrote:
On Sun, May 8, 2016 at 4:40 PM, Émanuel Barry wrote:
Take each X commit (say, every 100^th or 1000^th commit, or even every commit if we decide to be insane^Wprecise), store hashes of all files at that revision with possibly the file tree, in a .py file as a list or dict, or json or anything you prefer. Then I upload it for you to look at and you can compare with the mercurial repo. Or we run the same script on the mercurial repo and compare the resulting files.
If we store anything externally, that could start limiting us.
I read that as generating a temp file from each tool (git and hg) and then comparing them -- not as storing those files. (I could be wrong, though.) -- ~Ethan~
(I apologize for top-posting, I still haven’t figured out how to fix my email client)
There’s nearly 94k commits in the git repo, and I expect the hg repo has that same number. It’s a tad more than 10,000.
I’ll definitely take a look at that tool; my main weakness is that I don’t know hg commands or similar, but comparing separate commits is most definitely better.
@Ethan: I meant that I would write all the output to a file for comparison, but apparently that’s not a very good idea, so here I drop it instead.
I’ll look at the tool and see what I can do. I’ll try to document my findings if I can’t come up with a good solution, and probably even if I do.
Cheers,
-Emanuel
From: Senthil Kumaran [mailto:senthil@uthcode.com]
Sent: Sunday, May 08, 2016 8:43 PM
To: Émanuel Barry
Cc: core-workflow
Subject: Re: [core-workflow] Some questions
Hi Émanuel,
On Sun, May 8, 2016 at 4:40 PM, Émanuel Barry
Hi!
On Sun, May 08, 2016 at 07:40:15PM -0400, ??manuel Barry
Take each X commit (say, every 100th or 1000th commit, or even every commit if we decide to be insane^Wprecise), store hashes of all files at that revision with possibly the file tree, in a .py file as a list or dict, or json or anything you prefer. Then I upload it for you to look at and you can compare with the mercurial repo. Or we run the same script on the mercurial repo and compare the resulting files.
IMO the tool can be designed like this: 1. Generate the list of commits in a branch:: git log -m --first-parent --format='%H' 2. For every commit in the list generate the list of files in the commit:: git cat-file -p $SHA1^{tree} This produces a list in the format like this: 040000 tree e0fd616e5707b006f1a2df8be85d0be973192ee0 Doc 040000 tree 33e09fe5cdcd421c989de911c97fd1d901ac0e8e Grammar 040000 tree 39ca3d725f190d61aa45ea1c8bf4802f44f52e47 Include 100644 blob 84a3337c2e5289fb8e50e5ef6d8ac2ac78be70b2 LICENSE 040000 tree 08eeead22b72c75d84624509286e6c54ec6656ec Lib 040000 tree b1a2357d3d461d161d92d73aabb74f0a9ab52294 Mac 100644 blob 2a687e58c9141b44520db9ad0b07b71525fd051d Makefile.pre.in For every blob in the list store its hash. For every tree use git cat-file -p $SHA1^{tree} recursively. I don't have any idea how to do that in Mercurial, though.
-Emanuel
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On Mon, May 09, 2016 at 10:32:59AM +0200, Oleg Broytman
Hi!
On Sun, May 08, 2016 at 07:40:15PM -0400, ??manuel Barry
wrote: Take each X commit (say, every 100th or 1000th commit, or even every commit if we decide to be insane^Wprecise), store hashes of all files at that revision with possibly the file tree, in a .py file as a list or dict, or json or anything you prefer. Then I upload it for you to look at and you can compare with the mercurial repo. Or we run the same script on the mercurial repo and compare the resulting files.
IMO the tool can be designed like this:
1. Generate the list of commits in a branch::
git log -m --first-parent --format='%H'
2. For every commit in the list generate the list of files in the commit::
git cat-file -p $SHA1^{tree}
This produces a list in the format like this:
040000 tree e0fd616e5707b006f1a2df8be85d0be973192ee0 Doc 040000 tree 33e09fe5cdcd421c989de911c97fd1d901ac0e8e Grammar 040000 tree 39ca3d725f190d61aa45ea1c8bf4802f44f52e47 Include 100644 blob 84a3337c2e5289fb8e50e5ef6d8ac2ac78be70b2 LICENSE 040000 tree 08eeead22b72c75d84624509286e6c54ec6656ec Lib 040000 tree b1a2357d3d461d161d92d73aabb74f0a9ab52294 Mac 100644 blob 2a687e58c9141b44520db9ad0b07b71525fd051d Makefile.pre.in
For every blob in the list store its hash. For every tree use
git cat-file -p $SHA1^{tree}
Oops, my mistake:: git cat-file -p $TREE-SHA1
recursively.
I don't have any idea how to do that in Mercurial, though.
-Emanuel
Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
On Sun, 8 May 2016 at 16:29 Senthil Kumaran
On Sun, May 8, 2016 at 4:12 PM, Émanuel Barry
wrote: I understand that there's already a semi-official mirror of the cpython repo on GitHub, and I've been wondering why it isn't enough for our needs.
It is suitable for our needs. Our last discussion was about how do we ascertain that cpython git repo has the same history as the hg repo, so that after migrate we do not loose any information from the old system.
Right, we *hope* the mirror is good enough, but when Eli created it he didn't worry too much about accuracy so we need to evaluate if it's good enough to simply switch to or if it needs to be thrown out. Hence, the discussion about how to ascertain if the mirror is acceptable.
This could be done using:
* check the number of commits in both repos for each branch * checking the hash of the source files in two repos. * (And do we go about validating each piece of commit log graph too)?
If you have any suggestions, since you are using the cpython git mirror, please feel free to share your thoughts.
We will also want a mapping of hg commits to git commits for https://hg.python.org/lookup which might help with the validation of the mirror.
Welcome to the party!
+1!
I just wanted to let you guys know that I haven’t been able to make any progress last week or this week on this, and it sucks because I really wanted to help on it, but a lot has happened in a short span of time and I don’t know when I’ll be able to work on anything for this. I’ll still try to free up some time to work on it, but it won’t be for a while, sadly. -Emanuel
No worries, thanks for the update!
On Fri, May 20, 2016, 21:39 Émanuel Barry
I just wanted to let you guys know that I haven’t been able to make any progress last week or this week on this, and it sucks because I really wanted to help on it, but a lot has happened in a short span of time and I don’t know when I’ll be able to work on anything for this. I’ll still try to free up some time to work on it, but it won’t be for a while, sadly.
-Emanuel
Hello, I'm attaching a script that is an initial attempt at doing this for the git side of things. Everything is done in bash at the moment. It does make use of gnu parallel (which is the parallel package in both Ubuntu and debian). I don't think anything else it uses isn't a standard tool in linux (except git). Basically to run it do the following: 1) clone cpython.git into the current directory (i.e. try not to have any "generated" files) 2) put scan.sh in the current directory and run it there What it does is the following: It checkouts out every 1000th commit (can be changed) going backwards on the current branch and computes the md5sum for every file (except those found in .git) and puts the md5sums in a file in a outdir/ directory that it creates. The names of the files are $num-$commit where the $num is the number of commits _backwards_ from the current commit (which makes sense if you think about iterating backwards from the current commit). Running this on my laptop took ~11 minutes. I uploaded the output directory here in case you don't feel like running it: http://thomasnyberg.com/outdir.tar.bz2 (Ignore the frontpage of my "website". I'm obviously not all that concerned by it...) In any case, this might be helpful for others in addition to myself. I figured it was best to email the list before continuing (maybe isn't really what's needed...). Possible things to add to this: * doing something similar with comments * doing the same thing on all branches * maybe only compute the md5sum for changed files * little thought has gone into efficiency...there may be obvious gains hiding Of course something similar would have to be run with the hg version and then a comparison would need to be done. Hopefully this is helpful... Cheers, Thomas On 05/08/2016 07:28 PM, Senthil Kumaran wrote:
On Sun, May 8, 2016 at 4:12 PM, Émanuel Barry
mailto:vgr255@live.ca> wrote: I understand that there's already a semi-official mirror of the cpython repo on GitHub, and I've been wondering why it isn't enough for our needs.
It is suitable for our needs. Our last discussion was about how do we ascertain that cpython git repo has the same history as the hg repo, so that after migrate we do not loose any information from the old system.
This could be done using:
* check the number of commits in both repos for each branch * checking the hash of the source files in two repos. * (And do we go about validating each piece of commit log graph too)?
If you have any suggestions, since you are using the cpython git mirror, please feel free to share your thoughts.
Welcome to the party!
Thanks, Senthil
_______________________________________________ core-workflow mailing list core-workflow@python.org https://mail.python.org/mailman/listinfo/core-workflow This list is governed by the PSF Code of Conduct: https://www.python.org/psf/codeofconduct
participants (6)
-
Brett Cannon
-
Ethan Furman
-
Oleg Broytman
-
Senthil Kumaran
-
Thomas Nyberg
-
Émanuel Barry