Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run exports incorrectly applied for packages with builds with different exports #3617

Closed
jjhelmus opened this issue Jul 18, 2019 · 18 comments
Closed
Labels
locked [bot] locked due to inactivity stale::closed [bot] closed after being marked as stale stale [bot] marked as stale due to inactivity

Comments

@jjhelmus
Copy link
Contributor

jjhelmus commented Jul 18, 2019

Pulling this issues, reported by @minrk, out of #3610 since the original issue reported there has been address. Specifically the comments concerning run_exports with fftw.

If a channel contains packages with the same version number but different hashes or builds that express different run_exports, only one of these exports is captured in the channeldata.json file. The result is that the run_exports are incorrectly for the packages(s) which are not captured in the channeldata.json file. The slightly modified example from what @minrk gave in #3610 shows this behavior:

meta.yaml

package:
  name: testing
  version: 1.0

requirements:
  host:
    - fftw 3.3.8 nompi_h7f3a6c3_1106
$ conda render --channel https://conda-static.anaconda.org/conda-forge .
--------------
Hash contents:
--------------
{}
----------
meta.yaml:
----------
package:
    name: testing
    version: '1.0'
requirements:
    host:
        - _libgcc_mutex 0.1 main
        - libgfortran-ng 7.3.0 hdf63c60_0
        - libgcc-ng 9.1.0 hdf63c60_0
        - fftw 3.3.8 nompi_h7f3a6c3_1106
    run:
        - fftw * mpi_mpich_*
extra:
    copy_test_source_files: true
    final: true

Notice that the mpi_mpich from channeldata.json is being applied even though the nompi package is in the host environment.

With conda 3.17.8 the rendering uses the run_export information contained within the package itself:

conda render --channel https://conda-static.anaconda.org/conda-forge test/
--------------
Hash contents:
--------------
{}
----------
meta.yaml:
----------
package:
    name: testing
    version: '1.0'
requirements:
    host:
        - _libgcc_mutex 0.1 main
        - libgfortran-ng 7.3.0 hdf63c60_0
        - libgcc-ng 9.1.0 hdf63c60_0
        - fftw 3.3.8 nompi_h7f3a6c3_1106
extra:
    copy_test_source_files: true
    final: true
@jjhelmus
Copy link
Contributor Author

This would also occur if the packages in different subdirs (e.g. osx-64 and linux-64) had different run_exports as channeldata.json only considers the version of the package not the subdir.

@jjhelmus
Copy link
Contributor Author

I see two options for fixing this.

  1. Always use the run_export information from the packages. This requires that all packages in the build and host environments be downloaded when a recipe is rendered which can be slow.

  2. Expand the run_export information in channeldata.json to include subdir, version and build_string information.

My preference would be for the first option. Keeping run export information in two locations breaks the single responsibility principle and requires care that the two locations contain the same information. Using the information from the packages slows down rendering but during a build the packages will be available (they are installed into the environment after all) so it seem silly not to use the information they contain.

A compromise would be to add a configuration option that would switch to using channeldata. This may be useful when the speed of rendering a recipe is more important than the correctness.

@jjhelmus
Copy link
Contributor Author

#3603 implements option 1

@msarahan
Copy link
Contributor

I don't think it's a big deal to have run_exports in both packages and channeldata.json. Unlike index data, we don't hotfix this. I don't think subdir is necessary - the build strings or the hash_input.json data will likely include platform-specific differentiators, such as compiler, that will imply platform.

This said, I'm OK with the first choice, but only if we can come up with a way of caching the graph so that rendering can still be fast. We can perhaps explore using libcfgraph or something. I'm not picky on any particular solution, but we have to scale up our builds. This change represents a roughly 20% speedup in rendering (variable with the number of packages involved and also whether or not said packages are cached). That's not much on a single build, but it adds up, especially for large trees like our proposed R buildout.

@mingwandroid
Copy link
Contributor

What's the slow part? I don't see why, once using .conda packages extraction of this data should be fast. You need your dependencies when building anyway. Ok when only using conda render with no intention of building it'll be slowerbbut I don't think that's the common case?

Overall I don't mind which way we go.

@msarahan
Copy link
Contributor

reading one big JSON file is still miles faster than looping through many smaller ones. Rendering is core to our ability to do big things right now. If the need for rendering is obviated somehow (like caching a graph from post-build data), we can drop the channeldata scheme.

@jakirkham
Copy link
Member

Ok when only using conda render with no intention of building it'll be slowerbbut I don't think that's the common case?

This may come up when conda-smithy re-renders a feedstock. So yes could be common.

@mariusvniekerk
Copy link
Contributor

Frankly, conda-smithy rerenders aren't that bad atm (mostly since they tend to be fairly tightly scoped). Being correct is FAR more important imo. libcfgraph is basically just a post package upload metadata extraction, which should be easier to do with .conda.

@dhirschfeld
Copy link
Contributor

Being correct is FAR more important imo

+:100: I can't agree with this enough!

If there is ever any tension between performance and correctness, correctness has to win.

If conda doesn't do the right thing it is literally unusable as it will create broken environments.

@jjhelmus
Copy link
Contributor Author

#3621 drops the default behavior back to downloading packages to determine run_exports while allowing an opt-in use of channeldata using the --use-channeldata command line argument. Work is still needed to support variants in channeldata.json but the default behavior would correctly determine run_exports.

@mbargull
Copy link
Member

Having the build-related metadata separately available outside of the package archives surely makes sense, conceptually and performance-wise.
But if it's inconsistent with the source data, we really shouldn't use it until it is.

Meaning, if gh-3603 restores correctness, then please merge that PR.
When channeldata.json gets (more?) consistent with the actual metadata, a configuration option might make sense. Until then, IDK, as we know something is not right, such an option would be equivalent to --maybe-do-incorrect-stuff, i.e., it's not of much use...

N.B.: I don't know what channeldata.json contains or how it's used. (Is its use for run_exports new in 3.18? I'm a little out of the loop...)


  1. Expand the run_export information in channeldata.json to include subdir, version and build_string information.

I don't think subdir is necessary - the build strings or the hash_input.json data will likely include platform-specific differentiators, such as compiler, that will imply platform.

I'm confused. How would you be able to use the run_exports information correctly if you don't have the full build information (channel, subdir, version, (build number,) build string) available? Unless you impose restrictions on how/when the packager is able to apply/change run_exports in a build's meta.yaml, this can't work. (Imposing necessary restriction -- what ever those may be -- would be a breaking change, of course.)
If channeldata.json is agnostic to subdirs and build strings and such, it cannot be a valid input for figuring out run_exports. You simply need a repodata.json-like file that is augmented with build-only information like run_exports definitions. I'm confused.

@msarahan
Copy link
Contributor

Channeldata.json was something that @kalefranz brought in a long, long time ago, when he refactored conda-build's indexing. It wasn't really used for anything. I started to use it for run_exports as an optimization.

How would you be able to use the run_exports information correctly if you don't have the full build information (channel, subdir, version, (build number,) build string) available?

  • channel is implicit - which url is channeldata.json coming from?
  • subdir is (almost always) implicit, because almost all packages that have hashes have them because a compiler is used. Compilers are specific to platforms, by virtue of the compiler name always having the target platform as a suffix. This means the hash itself encodes subdir. This is probably practically good enough, but it is not 100% guaranteed that if you have a hash, you had a compiler. So OK, maybe we need subdir.
  • version: of course
  • build number: I hope not
  • build string: yeah, this one is important.

The idea of channeldata.json was to aggregate central stuff from all the packages in one relatively condensed file. Maybe it can still serve this purpose with deeper indexes for run_exports. I'm ambivalent.

@kalefranz
Copy link
Contributor

It wasn't really used for anything.

Actually that’s not true. The history actually predates me. It goes back to Ilan’s metadata.json file that was used by Black Duck. For many reasons it’s useful to have some form of high level index of the channel, not just the individual subdirs (i.e. repodata.json). The metadata.json file format has some structural issues, so when Gonzalo wanted to start using it for Navigator, we worked together on the improved format. The file is now used by Navigator. It’s also likely used by VSCode at this point, since they had a request in a while back to give conda the ability to interact with channel-wide information rather than just what’s in repodata. My guess is that pycharm might want to do something similar.

@minrk
Copy link
Contributor

minrk commented Jul 25, 2019

What would it mean to exclude build number? run_exports can change from one build to the next, so it seems like run_exports has to be keyed by the full package id, including build number.

Take the not-entirely-realistic-but-valid meta.yaml:

package:
  name: has-exports
build:
  number: 1
  run_exports:
      - {{ pin_compatible('libsodium', max_pin='x.x.x') }}

requirements:
  build:
    - {{ compiler('c') }}
  host:
    - libsodium
  • build 1 has libsodium[>=1.0.17,<1.0.18a0]
  • some time later, rebuild with build 2 after there's been a libsodium release. hash input has not changed, but run_exports is now libsodium[>=1.0.18,<1.0.19a0]
  • build a downstream package, depending on has-exports and the previous libsodium==1.0.17

The downstream package should pull in build 1, since it's incompatible with build 2 due to version pinning. If it gets run_exports from build 2, downstream will have mutually exclusive run dependencies on exactly 1.0.17 and 1.0.18.

So it seems like run_exports must include the build number, or the hash inputs need to be expanded so these two builds wouldn't share the same hash.

@jakirkham
Copy link
Member

Also ran into this recently.

@jjhelmus
Copy link
Contributor Author

@minrk I agree, run_exports can change between builds, hashes, variants and likely some other differences none of us have considered.

When building a conda package the run_export information of the packages in the build and host environment will always available from the packages themselves, so there is not reason not to use these.

When rendering but not building a recipe there are performance improvements if the packages themselves do not need to be downloaded. This information is available in channeldata but is incomplete and cases where run_exports are different for different variants, builds, etc. Still if you know the channels you are using have consistent run_exports or the use cases can handle small inconsistencies in rendering, this trade off is often acceptable.

For example, at Anaconda have some workflows which determine package build orders by rendering recipes. For this use case using channeldata is fine, the majority (maybe all?) of the packages in defaults have consistent run_exports and if they are incorrect during the builds the correct run_exports from the packages themselves are used.

A PR adding the necessary information to channeldata to account for builds, hashes, etc would be welcome but is not a priority for the Anaconda team.

@github-actions
Copy link

Hi there, thank you for your contribution!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed automatically if no further activity occurs.

If you would like this issue to remain open please:

  1. Verify that you can still reproduce the issue at hand
  2. Comment that the issue is still reproducible and include:
    - What OS and version you reproduced the issue on
    - What steps you followed to reproduce the issue

NOTE: If this issue was closed prematurely, please leave a comment.

Thanks!

@github-actions github-actions bot added the stale [bot] marked as stale due to inactivity label Feb 10, 2023
@github-actions github-actions bot added the stale::closed [bot] closed after being marked as stale label Mar 13, 2023
@jakirkham
Copy link
Member

Could we please reopen this?

@github-actions github-actions bot added the locked [bot] locked due to inactivity label Mar 15, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity stale::closed [bot] closed after being marked as stale stale [bot] marked as stale due to inactivity
Projects
Archived in project
Development

No branches or pull requests

9 participants