Hi Émanuel,
On Sun, May 8, 2016 at 4:40 PM, Émanuel Barry
Take each X commit (say, every 100th or 1000th commit, or even every commit if we decide to be insane^Wprecise), store hashes of all files at that revision with possibly the file tree, in a .py file as a list or dict, or json or anything you prefer. Then I upload it for you to look at and you can compare with the mercurial repo. Or we run the same script on the mercurial repo and compare the resulting files.
If we store anything externally, that could start limiting us. I looked at the problem in this angle - final cpython git repo has ~10000 commits in master branch. That's not a large number to deal with. The orginal hg repo should have exact number of commits. We have to do a diff between each of these commits, including merge commits. and check if contents of those commits are same, if we encounter anything where git-repo differs in content or history from hg-repo, we alert and fail. Since this is a history checking operation and we could complete this in O(minutes) or ~1 hour to validate the repos. This will give us confidence on the migration, and will help us evaluate multiple hg -> git repos that have been migrated at different points in time. This feature will go in this tool: https://github.com/orsenthil/cpython-hg-to-git , which we will use to migrate, sync, and validate hg->git repos. If interested, you could research for efficient way to do the above operation and submit a pull request against that tool. HTH, Senthil