Excessive disk use in Rosetta download and build processes

Member Site Forums Rosetta 3 Rosetta 3 – Build/Install Excessive disk use in Rosetta download and build processes

Viewing 6 reply threads
  • Author
    Posts
    • #760
      Anonymous

        The software I build and install gets pushed to standalone installations on about 100 different machines, and the installation usually lives on higher-cost storage (i.e. RAID or a SAN, etc), and then all those installations are backed up somewhere or other, so I usually take the time to weed out as many unnecessary files as possible. It makes everyone’s life a little better.

        Looking at the Rosetta packaging, there are a number of opportunities for improvements. The first one is very easy. Here’s an expanded rosetta3.2_bundles.tgz:

        $ du -k --max-depth=1 | sort -n
        48 ./.svn
        372 ./BioTools
        1896 ./manual
        70724 ./rosetta_demos
        90188 ./foldit
        556332 ./rosetta_source
        738672 ./rosetta_database
        1406972 ./rosetta_fragments
        2865216 .
        $ find . -type d -name '.svn' | xargs rm -rf
        $ du -k --max-depth=1 | sort -n
        148 ./BioTools
        1692 ./manual
        34376 ./rosetta_demos
        90192 ./foldit
        269804 ./rosetta_source
        367252 ./rosetta_database
        703292 ./rosetta_fragments
        1466768 .

        So 50% of the disk usage in your tarball is used by files that the end-user will never use. It also uses more bandwidth on the download for both you and your users. Is there some reason you’re leaving those in the download?

        I built a static version of Rosetta using the Intel Compilers for linux. Yes, I know a dynamic version would use less disk space, but I support many different linux distributions. I used this command line:

        ./scons.py bin mode=release cxx=icc extras=static

        Which works fine to create static binaries with the Intel compilers, but they don’t have their debug symbols stripped:

        $ file minirosetta.linuxiccrelease
        minirosetta.linuxiccrelease: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, statically linked, for GNU/Linux 2.2.5, not stripped

        Is the mode=release target not working for the Intel compilers? Why are the debug symbols still there?

        Using extras=static also creates an identical binary with the word “static” in its name:

        $ md5sum minirosetta.*
        39c344f6a6b45f1280a3fc7fe80d5c70 minirosetta.linuxiccrelease
        39c344f6a6b45f1280a3fc7fe80d5c70 minirosetta.static.linuxiccrelease

        That seems superfluous since the path to the binaries includes the word static. Actually the way that every binary is suffixed with the operating system, compiler and build target seems superfluous given that information is all contained in the path to the binaries. It also invalidates all your documentation from the perspective of non-technical users trying to cut and paste example commands from the manual.

        From a disk use perspective:

        $ du -sk rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/
        4336784 rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/
        $ rm rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/*.static.*
        $ strip rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/*.linuxiccrelease
        $ du -sk rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/
        2016292 rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/

        So greater than 50% disk savings by doing a few simple things.

      • #4913
        Anonymous

          None of the developer labs are limited by disk space, so it’s not really a consideration for us. If disk space and bandwidth became restrictively expensive, we’d optimize for it. As it is, developer time is vastly more expensive than anything else, so whatever makes development easier is what gets done. (Also, none of us use the release for anything, we use the development version.)

          A) We haven’t included the subversion files in the past. Assuming we included them on purpose this time, it’s probably so users can revert changes locally instead of redownloading the code. It sure is convenient to SVN revert mistakes… (Um, it’s also possible it was just an oversight, I don’t know, I don’t do the release.)

          B) Probably 50% or more of the stuff in the rosetta_source directory is junk beyond the svn files you removed. There’s a huge amount of testing data integrated into the codebase. Some of us have been trying to get it moved out but there’s institutional inertia. Where are you such that you are administrating Rosetta across multiple platforms? Let me know if you want help removing some of that extra data to slim your Rosetta installs down and I’ll be happy to point it out (big hint: everything in the test directory can probably go if you are managing it for the end users…)

          C) I don’t know enough about SCons to address it but I’ll point out your concern to someone who does.

        • #4914
          Anonymous

            Also, for the debug symbols in the intel/static/release build- yes, you’re probably right, SCons is probably screwed up. We’re SCons clients, not developers, so our use of it is certainly subtly wrong in many ways. If you can fix this I’ll port the fix back to trunk.

            You may be able to fix it by comparing the gcc/static/release to the intel/static/release options in tools/build/basic.settings.

          • #4916
            Anonymous

              Hi bene,

              Thanks for your careful investigation of the size of Rosetta.

              It was an oversight to not strip out the .svn directories–which we’ll take care of.

              As for building two copies of the executables, this is a known bug that should be fixed soon. For now feel free to delete the extra copies.

              As for stripping the symbols, we keep them in so users can get backtraces in the debugger should they run into a problem and want to try to figure out what is going on. If you are concerned about the executable size the best thing to do would be to call ‘strip’ on each executable

              Best of luck,
              Matt

            • #4934
              Anonymous

                Thanks for your comments and your offer of help, smlewis.

                • #4947
                  Anonymous

                    Here’s a list of things that aren’t required across all installations (you should keep them on the master installation). deleting these might pare it down by half; it won’t affect the size of the compiled code (which is waaaaay bigger than the uncompiled code anywy…)

                    foldit
                    manual (consider your users, might want to keep)
                    rosetta_demos (consider your users, might want to keep)
                    rosetta_fragments

                    in rosetta_source:
                    test (310 MB! more than src! ugh!)
                    stubs
                    analysis

                    in rosetta_source/src:
                    rdwizard
                    integration
                    python (probably)

                    in rosetta_source/src/apps/
                    benchmark
                    curated
                    pilot

                • #5284
                  Anonymous

                    just a me too for reducing the release footprint of rosetta.

                  • #5296
                    Anonymous

                      The svn files were removed for 3.2.1, just so everybody knows.

                  Viewing 6 reply threads
                  • You must be logged in to reply to this topic.